Software Engineer, Accelerator Systems & Technologies at Meta

Remote, Oregon, USA -

Full Time

Start Date

Immediate

Expiry Date

21 Nov, 25

Salary

85.1

Posted On

21 Aug, 25

Experience

6 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Scalability, Computer Engineering, Cuda, Computer Science, Design, Measurements, Specifications

Industry

Information Technology/IT

Description

Meta is seeking an experienced software engineer to join our Accelerator Solutions & Technologies group, supporting the development of Meta’s accelerators collective communications software library and optimizing distributed AI/ML workloads’ performance. This is an opportunity to work with a highly skilled engineering team, collaborating with a large set of cross-functional and international partners. Meta’s next-generation, super-cluster AI/ML platforms offer the opportunity to work in an extremely dynamic environment, enabling core technologies deployed in some of the world’s largest scale clusters.

MINIMUM QUALIFICATIONS:

Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Masters or PhD in Computer Science, Computer Engineering, or any other relevant technical field
6+ years experience in developing C++ codebase
Understanding of performance, benchmarking measurements, and optimization of collective communication and distributed at-scale model training

PREFERRED QUALIFICATIONS:

Understanding of the transport stack (e.g., RoCE), its constraints and performance measures and how transport considerations enable the collective communications stack.
Knowledge of AI/HPC hardware requirements and specifications (e.g., configuring hardware components, GPU, memory, network for AI/HPC workloads).
Full-stack experience and understanding of AI/HPC systems, from HW/infrastructure through the application layer, performance optimizations, including familiarity with relevant tools, libraries, and frameworks (e.g., NCCL, PyTorch, CUDA).
Experience in one or more of the following machine learning domains: hardware accelerators, AI Infrastructure, and/or high performance computing (HPC), particularly pertaining to interconnect and collective.

Responsibilities

Understand and contribute to the collective communications library, intended to be deployed on Meta’s AI/ML superclusters
Design and implement communication features for next generation AI/ML workloads
Support networking and compute hardware acceleration techniques to improve ML inference and training model performance
Support large-scale deployment of collective communication libraries across Meta’s infrastructure
Perform architectural analysis to ensure system designs meet performance, scalability, and reliability requirements
Analyze simulation results to guide firmware development and optimization efforts