Site Reliability Engineer, AI Infrastructure & Observability
at Tesla

Palo Alto, California, USA -

Start Date	Expiry Date	Salary	Posted On	Experience	Skills	Telecommute	Sponsor Visa
Immediate	29 May, 2024	USD 348000 Annual	01 Mar, 2024	3 year(s) or above	Bash,Python,Internal Customers,Distributed Systems,Splunk	No	No

Add to Wishlist Apply All Jobs

Required Visa Status:

Citizen	GC
US Citizen	Student Visa
H1B	CPT
OPT	H4 Spouse of H1B
GC Green Card

Employment Type:

Full Time	Part Time
Permanent	Independent - 1099
Contract – W2	C2H Independent
C2H W2	Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

Tesla participates in the E-Verify Program
What to Expect
Tesla’s NOC supports global Infrastructure, Manufacturing, and Applications to identify and resolve problems with high-speed cross-team collaboration. As part of this function, we work closely with the high-performance computing and AI infrastructure teams within IT Infrastructure. With the rapidly-growing need for more data and optimized compute resources, our observability and service delivery need to scale in parallel. We are looking for a Site Reliability Engineer to join our team with a focus on AI Infrastructure and Observability. This hybrid role will work closely with Incident Management as well as observability, traffic, and other software & infrastructure leads to monitor and optimize Tesla’s AI Infrastructure,
As a Site Reliability Engineer, you will be responsible for problem detection and escalation for our AI Infrastructure, ensuring engineering teams across Autopilot/AI and Dojo have the necessary tools and resources to be productive. This is a hands on technical role and a successful candidate should combine strong technical, analytical, and service delivery backgrounds to excel in this role.

What You’ll Do

Collaborate with a cross-functional team of SRE engineers, architects, and other stakeholders to understand complex application architectures, enabling the implementation of an effective top-down monitoring strategy for holistic service visibility
Build, maintain, and monitor dashboards for critical infrastructure
Create and tune alerts for network and hardware so that potential problems are identified, routed, and remediated early
Facilitate knowledge sharing by creating and maintaining detailed and comprehensive documentation, diagrams, and runbooks
Respond to and resolve support requests in a timely fashion while managing project timelines and other responsibilities
Serve as a frontline support resource to AI Software teams to triage problems and engage relevant engineering support
Participate in 24x7 on-call rotation

What You’ll Bring

Sound judgement, outstanding communication, & ability to work with internal customers in a fast-paced, high visibility role
Proficiency in high-level programming language and/or scripting with (Python, Golang, Bash)
Experience with troubleshooting distributed systems
Strong knowledge of multiple observability tools : Splunk, Prometheus/Alert Manager, Synthetic Monitoring, Grafana
Prior Experience in Catchpoint and/or Kentik a plus
Strong understanding of Linux fundamentals (Ubuntu/RHEL OS)
Excellent understanding of Network and Traffic fundamentals
Experience in collaborating with network and data center teams for large scale infrastructure support
3+ years of additional equivalent experience or evidence of exceptional ability related to the position

Compensation and Benefits
Benefits

Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:

Aetna PPO and HSA plans > 2 medical plan options with $0 payroll deduction
Family-building, fertility, adoption and surrogacy benefits
Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
Company Paid (Health Savings Account) HSA Contribution when enrolled in the High Deductible Aetna medical plan with HSA
Healthcare and Dependent Care Flexible Spending Accounts (FSA)
LGBTQ+ care concierge services
401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
Company paid Basic Life, AD&D, short-term and long-term disability insurance
Employee Assistance Program
Sick and Vacation time (Flex time for salary positions), and Paid Holidays
Back-up childcare and parenting support resources
Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
Weight Loss and Tobacco Cessation Programs
Tesla Babies program
Commuter benefits
Employee discounts and perks program

Expected Compensation
$104,000 - $348,000/annual salary + cash and stock awards + benefits
Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
Tesla is an Equal Opportunity / Affirmative Action employer committed to diversity in the workplace. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, age, national origin, disability, protected veteran status, gender identity or any other factor protected by applicable federal, state or local laws.
Tesla is also committed to working with and providing reasonable accommodations to individuals with disabilities. Please let your recruiter know if you need an accommodation at any point during the interview process.

Responsibilities:

Collaborate with a cross-functional team of SRE engineers, architects, and other stakeholders to understand complex application architectures, enabling the implementation of an effective top-down monitoring strategy for holistic service visibility
Build, maintain, and monitor dashboards for critical infrastructure
Create and tune alerts for network and hardware so that potential problems are identified, routed, and remediated early
Facilitate knowledge sharing by creating and maintaining detailed and comprehensive documentation, diagrams, and runbooks
Respond to and resolve support requests in a timely fashion while managing project timelines and other responsibilities
Serve as a frontline support resource to AI Software teams to triage problems and engage relevant engineering support
Participate in 24x7 on-call rotatio

REQUIREMENT SUMMARY

Experience:Min:3.0Max:8.0 year(s)

Industry:Information Technology/IT

Functional area of job:IT Software - Other

Domain:Software Engineering

Qualifications:Graduate

English Proficiency:Proficient

Number of posts:1

Address of job:Palo Alto, CA, USA

Site Reliability Engineer, AI Infrastructure & Observability
at Tesla

Required Visa Status:

Employment Type:

REQUIREMENT SUMMARY

INDIA

AUSTRALIA

UNITED ARAB EMIRATES

Site Reliability Engineer, AI Infrastructure & Observabilityat Tesla

Required Visa Status:

Employment Type:

REQUIREMENT SUMMARY

Site Reliability Engineer, AI Infrastructure & Observability
at Tesla