Senior Site Reliability Engineer - Control Plane at Lambda

San Francisco, California, USA -

Full Time

Start Date

Immediate

Expiry Date

10 Jul, 25

Salary

385000.0

Posted On

10 Apr, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Good communication skills

Industry

Information Technology/IT

Description

In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences. We began as an AI company built by AI engineers. That hasn’t changed. Today, we’re on a mission to be the world’s top AI computing platform. We equip engineers with the tools to deploy AI that is fast, secure, affordable, and built to scale. Whether they need powerhouse GPU hardware on-site or the flexibility of cloud-based solutions, we’ve got the horsepower to make it happen. Lambda’s AI Cloud has been adopted by the world’s leading companies and research institutions including Anyscale, Rakuten, The AI Institute, and multiple enterprises with over a trillion dollars of market capitalization. Our goal is to make computation as effortless and ubiquitous as electricity.

If you’d like to build the world’s best deep learning cloud, join us.

Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.

SALARY RANGE INFORMATION

Based on market data and other factors, the annual salary range for this position is $245,000 - $385,000. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

Design and implement cloud-native architectures that deliver the “four nines” (99.99%) of reliability while balancing performance and cost efficiency
Develop comprehensive monitoring and alerting systems with actionable dashboards that provide real-time visibility into system health
Implement SLIs, SLOs, and SLAs across services and maintain error budgets to guide development priorities
Automate deployments using tools like Argo and Terraform
Create robust incident management processes, escalation paths, and documentation
Architect fault-tolerant systems with graceful degradation capabilities to handle component failures
Design and implement disaster recovery solutions with regular testing procedures
Lead post-incident reviews that focus on systemic improvements rather than individual blame
Champion reliability best practices and system design principles
Build automated, auditable, and compliant processes to improve efficiency and productivity