Lead Machine Learning Engineer (m/w/d) at ThoughtWorks
50825 Köln, Nordrhein-Westfalen, Germany -
Full Time


Start Date

Immediate

Expiry Date

24 May, 25

Salary

0.0

Posted On

24 Feb, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Failure Modes, Fine Tuning, Accountability, Training, Buy In, Stakeholder Management

Industry

Information Technology/IT

Description

THE TEAM

This team will provide 24x7 white-glove support to people using large blocks of GPUs (6,000+ contiguous GPUs) for a short period of time (eg: 6-weeks, 12-weeks etc) to perform Managed Post Training. This includes helping with preparation, 24x7 support during training to ensure full utilization of the GPU clusters and off-boarding. The team is in three timezones with hand-off protocols to enable 24x7 support: US, Europe and India.

TECHNICAL SKILLS

  • You have proven experience in distributed training of large language models (LLMs) across multiple worker nodes and GPUs.
  • You have deep understanding of LLM architectures, including transformer-based models, and demonstrated ability to design and implement custom models.
  • You have expertise in monitoring large training jobs in a distributed environment and ability to debug job failures.
  • You have deep expertise in Pytorch (or Tensorflow) and debugging training failure modes.
  • You have deep Knowledge of fine-tuning or training with open-weight Gen AI models (i.e. Llama,Mistral, Gemma).
  • You have previous experience with Weights & Biases, Run.ai, Pytorch, Tensorflow, Hugging Face libraries.
  • You have expereicence but not limited to NVIDIA NeMo Stack (for both training and inference).

PROFESSIONAL SKILLS

  • You will be part of a client facing white glove service where a high level of professionalism is required.
  • You understand the importance of stakeholder management and can easily liaise between clients and other key stakeholders throughout projects, ensuring buy-in and gaining trust along the way.
  • You are resilient in ambiguous situations and can adapt your role to approach challenges from multiple perspectives.
  • You don’t shy away from risks or conflicts, instead you take them on and skillfully manage them.
  • You are eager to coach, mentor and motivate others and you aspire to influence teammates to take positive action and accountability for their work.
  • You enjoy influencing others and always advocate for technical excellence while being open to change when needed.
Responsibilities

THE ROLE

While you can be a specialist in MLE, you need to know enough about cluster operations.

JOB RESPONSIBILITIES

  • You will help shape and iterate this new white glove support service.
  • You will work in close collaboration with a Lead Cluster Operations Support Engineer.
  • You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster. Eg: We need to improve observability, or we need to automate user onboarding, or we need to bring in a new tool which everyone seems to want to use etc.
  • You will help assess the model training readiness and data preparation.
  • You will provide model training support rotating daytime weekend shifts - with pagers, to any issues they may encounter. These can range from infrastructure issues to data sciences issues or anything in between: eg: AWS changed a configuration in EKS that affects the training.
  • You will facilitate collaborative problem solving within the team by actively listening, communicating effectively and mentoring other engineers.
  • You will contribute to the development and execution of the team’s overall ML strategy, aligning technical capabilities with business objectives.
  • You will proactively identify and address challenges related to the white glove service for continued pre training, proposing solutions and implementing improvements.
Loading...