Software Engineer, Accelerator Systems & Technologies at Meta
Remote, Oregon, USA -
Full Time


Start Date

Immediate

Expiry Date

21 Nov, 25

Salary

85.1

Posted On

21 Aug, 25

Experience

6 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Scalability, Computer Engineering, Cuda, Computer Science, Design, Measurements, Specifications

Industry

Information Technology/IT

Description

Meta is seeking an experienced software engineer to join our Accelerator Solutions & Technologies group, supporting the development of Meta’s accelerators collective communications software library and optimizing distributed AI/ML workloads’ performance. This is an opportunity to work with a highly skilled engineering team, collaborating with a large set of cross-functional and international partners. Meta’s next-generation, super-cluster AI/ML platforms offer the opportunity to work in an extremely dynamic environment, enabling core technologies deployed in some of the world’s largest scale clusters.

MINIMUM QUALIFICATIONS:

  • Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Masters or PhD in Computer Science, Computer Engineering, or any other relevant technical field
  • 6+ years experience in developing C++ codebase
  • Understanding of performance, benchmarking measurements, and optimization of collective communication and distributed at-scale model training

PREFERRED QUALIFICATIONS:

  • Understanding of the transport stack (e.g., RoCE), its constraints and performance measures and how transport considerations enable the collective communications stack.
  • Knowledge of AI/HPC hardware requirements and specifications (e.g., configuring hardware components, GPU, memory, network for AI/HPC workloads).
  • Full-stack experience and understanding of AI/HPC systems, from HW/infrastructure through the application layer, performance optimizations, including familiarity with relevant tools, libraries, and frameworks (e.g., NCCL, PyTorch, CUDA).
  • Experience in one or more of the following machine learning domains: hardware accelerators, AI Infrastructure, and/or high performance computing (HPC), particularly pertaining to interconnect and collective.
Responsibilities
  • Understand and contribute to the collective communications library, intended to be deployed on Meta’s AI/ML superclusters
  • Design and implement communication features for next generation AI/ML workloads
  • Support networking and compute hardware acceleration techniques to improve ML inference and training model performance
  • Support large-scale deployment of collective communication libraries across Meta’s infrastructure
  • Perform architectural analysis to ensure system designs meet performance, scalability, and reliability requirements
  • Analyze simulation results to guide firmware development and optimization efforts
Loading...