Backend Software Engineer (ML Infra) at Rockstar

San Francisco, California, United States -

Full Time

Start Date

Immediate

Expiry Date

22 Mar, 26

Salary

0.0

Posted On

22 Dec, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Backend Engineering, Distributed Systems, Networking, Backend Architecture, Python, Go, Cloud-Native Systems, Containers, Orchestration, Kubernetes, GPU Workloads, Monitoring, Logging, Observability, ML Infrastructure, Training Pipelines

Industry

Philanthropic Fundraising Services

Description

Rockstar is recruiting for a mobile-first digital product studio that turns ideas into extraordinary experiences. They are a team of dynamic and savvy professionals who know how to create killer digital products. Our lean structure and remote team mean we can move fast while still delivering top-notch technology and design. Our client is building the AI backbone for the next generation of intelligent products. They help fast-growing AI startups design, fine-tune, evaluate, deploy, and maintain specialized models across text, vision, and embeddings. Think of them as “AWS for AI models”—not data or raw compute, but a full-stack backend for fine-tuning, reinforcement learning, inference, and long-term model maintenance. Their customers are Series A–C AI companies building enterprise-grade products. Their promise is simple: they make your AI system better. They are hiring a Backend Software Engineer (ML Infrastructure) to help design, build, and scale the core systems that power large-scale model training and deployment. The candidate will work on distributed training pipelines, cloud-native infrastructure, and internal developer platforms that support fine-tuning, reinforcement learning, and inference at scale. This role sits at the intersection of backend engineering and ML systems—the candidate will collaborate closely with ML engineers while owning production-grade infrastructure. This is an ideal role for an early-career engineer who wants to work on real distributed systems, GPU workloads, and modern ML infrastructure—not dashboards or CRUD apps. What You’ll Do Build & Scale Core Infrastructure - Design and implement backend systems that support large-scale ML workloads, including fine-tuning and reinforcement learning. - Build distributed training and inference pipelines that are efficient, fault-tolerant, and observable. - Develop internal developer tools and platforms that make it easier for ML engineers to train, evaluate, and deploy models. Cloud & Systems Engineering - Work on cloud-native systems using containers and orchestration (e.g., Kubernetes). - Optimize systems for performance, reliability, and cost efficiency, especially for GPU-heavy workloads. - Implement monitoring, logging, and observability for long-running training jobs and production services. Collaborate with ML Engineers - Partner closely with ML engineers to support evolving model architectures, training workflows, and evaluation needs. - Translate ML requirements into scalable backend and infrastructure solutions. Who You Are Required - 1–3 years of backend engineering experience, ideally working on production systems. - Strong fundamentals in distributed systems, networking, and backend architecture. - Experience building systems that scale under real load. - Comfortable working in Python and/or Go (or similar backend languages). - Excited to work on-site in San Francisco with a fast-moving early-stage team. Strongly Preferred - Experience with or exposure to ML infrastructure or ML platforms. - Familiarity with GPU workloads, training pipelines, or inference systems. - Experience with containerization and orchestration (Docker, Kubernetes). - Contributions to or deep familiarity with ML infrastructure libraries such as: - Ray - vLLM - SGLang - or similar distributed ML systems Bonus - Computer science background from a top-tier program or equivalent demonstrated excellence. - Open-source contributions, research projects, or side projects in systems or ML infrastructure. - A track record of high ownership and technical curiosity.

Responsibilities

The candidate will design, build, and scale core systems for large-scale model training and deployment. They will work on distributed training pipelines and internal developer platforms that support ML workflows.