Lead Cluster Operations Support Engineer (m/f/d) at ThoughtWorks
20355 Hamburg, Neustadt, Germany -
Full Time


Start Date

Immediate

Expiry Date

24 May, 25

Salary

0.0

Posted On

24 Feb, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Buy In, Accountability, Training, Azure, Code, Aws, Stakeholder Management, Linux

Industry

Information Technology/IT

Description

This team will provide 24x7 white-glove support to people using large blocks of GPUs (6,000+ contiguous GPUs) for a short period of time (eg: 6-weeks, 12-weeks etc) to perform Managed Post Training (MPT). This includes helping with preparation, 24x7 support during training to ensure full utilization of the GPU clusters and off-boarding. The team is in three timezones with hand-off protocols to enable 24x7 support: US, Europe and India. While you can be a specialist in Infra and cluster operations, you need to know enough about ML.

TECHNICAL SKILLS

  • Deep expertise Kubernetes administration and debugging at scale.
  • Deep knowledge of managing large clusters with 1000s of nodes with K8s.
  • Knowledge of running training workloads on 1000s of GPUs.
  • Knowledge of working with the Lustre filesystem is a plus.
  • Knowledge of working with NVIDIA NeMo Framework (Docker image for model training).
  • Knowledge of working with NVIDIA NeMo NIMs (Docker images for inference).
  • Underlying Cloud: GCP, AWS, Azure.
  • Terraform / Pulumi, Helm Charts, Linux, other Infrastructure-as-code tools.
  • Nice to have: Run:ai, TrueFoundry, Huggingface platform etc (can provide training).
  • Knowledge of working with HPC technologies such as Slurm is a bonus.

PROFESSIONAL SKILLS

  • You will be part of a high value client facing white glove service, where a high level of professionalism is required.
  • You understand the importance of stakeholder management and can easily liaise between clients and other key stakeholders throughout projects, ensuring buy-in and gaining trust along the way.
  • You are resilient in ambiguous situations and can adapt your role to approach challenges from multiple perspectives.
  • You don’t shy away from risks or conflicts, instead you take them on and skillfully manage them.
  • You are eager to coach, mentor and motivate others and you aspire to influence teammates to take positive action and accountability for their work.
  • You enjoy influencing others and always advocate for technical excellence while being open to change when needed.
  • You have an insatiable curiosity and a drive to learn new things.
Responsibilities
  • You will help shape and iterate this new white glove model training support service on large GPU clusters.
  • You will work in a collaborative team with Machine Learning Engineers and Infrastructure Engineers.
  • You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster. Eg: We need to improve observability, or we need to automate user onboarding, or we need to bring in a new tool which everyone seems to want to use etc. This will probably involve a combination of Terraform/Pulumi, Helm Charts, Python and Shell Scripts.
  • You will help assess the model training readiness and data preparation.
  • You will provide model training support rotating daytime weekend shifts - with pagers, to any issues they may encounter. These can range from infrastructure issues to data sciences issues or anything in between: eg: GCP changed a configuration in GKE that affects the training.
  • You will facilitate collaborative problem solving within the team by actively listening, communicating effectively and mentoring other engineers.
  • You will proactively identify and address challenges related to the white glove service for continued pre training, proposing solutions and implementing improvements.
Loading...