Senior Site Reliability Engineer - Control Plane at Lambda
San Francisco, California, USA -
Full Time


Start Date

Immediate

Expiry Date

10 Jul, 25

Salary

385000.0

Posted On

10 Apr, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Good communication skills

Industry

Information Technology/IT

Description

In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences. We began as an AI company built by AI engineers. That hasn’t changed. Today, we’re on a mission to be the world’s top AI computing platform. We equip engineers with the tools to deploy AI that is fast, secure, affordable, and built to scale. Whether they need powerhouse GPU hardware on-site or the flexibility of cloud-based solutions, we’ve got the horsepower to make it happen. Lambda’s AI Cloud has been adopted by the world’s leading companies and research institutions including Anyscale, Rakuten, The AI Institute, and multiple enterprises with over a trillion dollars of market capitalization. Our goal is to make computation as effortless and ubiquitous as electricity.

If you’d like to build the world’s best deep learning cloud, join us.

  • Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.

SALARY RANGE INFORMATION

Based on market data and other factors, the annual salary range for this position is $245,000 - $385,000. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

Responsibilities
  • Design and implement cloud-native architectures that deliver the “four nines” (99.99%) of reliability while balancing performance and cost efficiency
  • Develop comprehensive monitoring and alerting systems with actionable dashboards that provide real-time visibility into system health
  • Implement SLIs, SLOs, and SLAs across services and maintain error budgets to guide development priorities
  • Automate deployments using tools like Argo and Terraform
  • Create robust incident management processes, escalation paths, and documentation
  • Architect fault-tolerant systems with graceful degradation capabilities to handle component failures
  • Design and implement disaster recovery solutions with regular testing procedures
  • Lead post-incident reviews that focus on systemic improvements rather than individual blame
  • Champion reliability best practices and system design principles
  • Build automated, auditable, and compliant processes to improve efficiency and productivity
Loading...