Software Engineer - LLM Training at CentML
SFBA, California, USA -
Full Time


Start Date

Immediate

Expiry Date

18 Jul, 25

Salary

0.0

Posted On

18 Apr, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Good communication skills

Industry

Computer Software/Engineering

Description

ABOUT US

We believe AI will fundamentally transform how people live and work. CentML’s mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.
Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts at companies like Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who holds multiple academic and industry research awards from Google, Amazon, Facebook, and VMware.

ABOUT THE POSITION

We are seeking highly crafted and motivated software engineers to join our team to empower AI practitioners to develop AI models on CentML Platform, productively and affordably. If you have launched multi-node distributed training jobs before and experienced firsthand how painful and cumbersome to get it functional, let alone high-performing, and you wanna be part of the team that derives solutions to address this challenge so that other AI practitioners wouldn’t feel the same pain that you had, please come and join us!

Responsibilities
  • Design and implement highly efficient distributed training systems for large-scale deep learning models.
  • Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs.
  • Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks.
  • Productionize the training systems onto CentML Platform.
  • Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques.
  • Contribute to the design of APIs, abstractions and UX that make it easier to scale models while maintaining usability and flexibility.
  • Profile, debug, and tune performance at the system, model, and hardware levels.
  • Participate in design discussions, code reviews, and technical planning to ensure the product aligns with business goals.
  • Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems.
Loading...