Software Engineer, Accelerator Solutions & Technologies at Meta
Menlo Park, CA 94025, USA -
Full Time


Start Date

Immediate

Expiry Date

16 Nov, 25

Salary

56.25

Posted On

16 Aug, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Optimization, Architects, Cuda, Computer Science, Systemc, Teams, Computer Engineering, Performance Measurement, Scalability, Specifications, Engineers

Industry

Information Technology/IT

Description

ENGINEERING

Meta is seeking an experienced software engineer to join our Accelerator Solutions & Technologies group, supporting the development of Meta’s accelerators collective communications software library and optimizing distributed AI/ML workloads’ performance. This is an opportunity to work with a highly skilled engineering team, collaborating with a large set of cross-functional and international partners. Meta’s next-generation, super-cluster AI/ML platforms offer the opportunity to work in an extremely dynamic environment, enabling core technologies deployed in some of the world’s largest scale clusters.

MINIMUM QUALIFICATIONS

  • Currently has, or is in the process of obtaining a Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
  • Masters or PhD in Computer Science, Computer Engineering, or any other relevant technical field
  • 2+ years experience in developing C++ codebase
  • 2+ years experience in developing Python codebase
  • Understanding of performance, benchmarking measurement, and optimization on collective communications and distributed at-scale model training

PREFERRED QUALIFICATIONS

  • Experience with SystemC
  • Knowledge of AI/HPC hardware requirements and specifications (e.g., configuring hardware components for AI/HPC workloads)
  • Understanding of the transport stack (e.g., RoCE) and its constraints particularly pertaining to interconnect and collective
  • Familiarity with relevant tools, libraries, and frameworks (e.g., PyTorch, CUDA)
  • Full-stack experience and understanding of AI/HPC systems, with a focus on the application layer and performance optimizations
    For those who live in or expect to work from California if hired for this position.
Responsibilities
  • Contribute to our developer infrastructure, including simulation and HW emulation platforms, to enable performance measurement and optimization for Meta’s in-house accelerator programs
  • Understand and contribute to the collective communications library, intended to be deployed on Meta’s AI/ML superclusters
  • Support networking and compute hardware acceleration techniques to improve ML inference and training model performance
  • Perform architectural analysis to ensure system designs meet performance, scalability, and reliability requirements
  • Implement simulation models for Meta’s Accelerator ASICs, develop and analyze various scenarios to evaluate data center performance and identify potential improvements
  • Collaborate with architects and engineers to integrate simulation results into system design processes
  • Use instruction set simulators to define performant firmware for Meta’s training/inference accelerators
  • Collaborate with hardware and firmware teams to ensure accurate modeling and simulation of accelerator functionalities
  • Analyze simulation results to guide firmware development and optimization efforts
Loading...