Software Engineer - XR Codec Interactions and Avatars Team at OCULUS

Redmond, Washington, USA -

Full Time

Start Date

Immediate

Expiry Date

01 Aug, 25

Salary

70.67

Posted On

01 May, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Forecasting, Lamp, Network Security, Kubernetes, File Systems, Computer Science, Capacity Planning, Unit Testing, Web Technologies, Mysql, Python, Optimization, User Experience, Telemetry, Computer Engineering, Performance Measurement, Rust, C++, Logging

Industry

Information Technology/IT

Description

XR Codec Interactions and Avatars (XRCIA) brings together a highly interdisciplinary team of researchers and engineers to create the future of augmented and virtual reality. On the Research Oriented Cluster Foundations team, you’ll work on building and maintaining tools, libraries, and frameworks that will help researchers collaborate with each other and empower their research towards the generation of Codec Interactions and Avatars. Our team cultivates an honest and considerate environment where self-motivated individuals thrive. We encourage ownership and embrace the ambiguity that comes with working on the frontiers of research.In this software engineer role, you will serve as the point of contact for Meta’s research GPU super clusters. You are a hybrid software/systems/infrastructure engineer who ensures that Meta’s Research Super Clusters run smoothly and have the capacity for future growth. Our team is composed of people with varied levels of experience and backgrounds. Relevant industry experience is important (Software Engineer, Site Reliability Engineer (SRE), Systems Engineer, DevOps Engineer, Network Engineer, or similar role), but ultimately less so than your demonstrated attitude. We sail into uncharted waters every day at Meta’s large scale ML model training GPU clusters, and we are always learning.

MINIMUM QUALIFICATIONS:

Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
3+ years of experience in UNIX/LINUX and clear understanding of TCP/IP network fundamentals
5+ years of experience coding in at least one of the following languages: C++, Python, or Rust
Experience with software development practices such as source control, code reviews, unit testing, debugging and profiling
Experience with Internet service architecture capacity planning and/or handling needs for urgent capacity augmentation
Knowledge of common web technologies and/or Internet service architectures (such as LAMP or MEAN stacks, CDN, Load Balancing techniques, etc.)
Experience configuring and running infrastructure level applications, such as Kubernetes, Terraform, MySQL, SLURM, etc.

PREFERRED QUALIFICATIONS:

Thorough understanding of Linux operating system, including the networking subsystem
Experience in distributed system performance measurement, logging, and optimization
Experience with Python library management systems such as Conda
Prior experience in cluster oncall operations, including troubleshooting server/scheduler/storage errors, maintaining compute/storage environments/libraries/tools, helping onboard users to the cluster, and answering general questions from users
Prior experience in cluster coordination and strategy planning, including collecting/understanding needs of users, developing tools to improve user experience, providing guidance on best practices, forecasting compute/storage needs, and developing long-term user experience/compute/storage strategies
Prior experience building tooling for monitoring and telemetry
Prior experience in developing/managing distributed network file systems
Prior experience in network security

Responsibilities

Leverage the scale and complexity of the larger Meta infrastructure to accelerate our Codec Interaction and Avatars projects
Influence outcomes within your immediate team, peer engineering teams, and with cross-functional stakeholders
Work independently, handle large projects simultaneously, and prioritize team roadmap and deliverables by balancing required effort with resulting impact
Own Research Super Cluster back-end services which handle fleet management, infrastructure components that drive Meta’s advances in AI, core services which are used by every team at XRCIA, networking systems, and everything in between
Develop and review code, develop documentation and capacity plans, and debug the hardest problems, all live, on some of the largest and most complex systems in the world
Together with your engineering team, you will share an on-call rotation and be an escalation contact for service incidents. Provide on-call support and lead incident root cause analysis through multiple data engineering layers (compute, storage, network) for GPU clusters and act as a final escalation point