Site Reliability Engineer, AI Infrastructure & Observability

at  Tesla

Palo Alto, California, USA -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate29 May, 2024USD 348000 Annual01 Mar, 20243 year(s) or aboveBash,Python,Internal Customers,Distributed Systems,SplunkNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

Tesla participates in the E-Verify Program
What to Expect
Tesla’s NOC supports global Infrastructure, Manufacturing, and Applications to identify and resolve problems with high-speed cross-team collaboration. As part of this function, we work closely with the high-performance computing and AI infrastructure teams within IT Infrastructure. With the rapidly-growing need for more data and optimized compute resources, our observability and service delivery need to scale in parallel. We are looking for a Site Reliability Engineer to join our team with a focus on AI Infrastructure and Observability. This hybrid role will work closely with Incident Management as well as observability, traffic, and other software & infrastructure leads to monitor and optimize Tesla’s AI Infrastructure,
As a Site Reliability Engineer, you will be responsible for problem detection and escalation for our AI Infrastructure, ensuring engineering teams across Autopilot/AI and Dojo have the necessary tools and resources to be productive. This is a hands on technical role and a successful candidate should combine strong technical, analytical, and service delivery backgrounds to excel in this role.

What You’ll Do

  • Collaborate with a cross-functional team of SRE engineers, architects, and other stakeholders to understand complex application architectures, enabling the implementation of an effective top-down monitoring strategy for holistic service visibility
  • Build, maintain, and monitor dashboards for critical infrastructure
  • Create and tune alerts for network and hardware so that potential problems are identified, routed, and remediated early
  • Facilitate knowledge sharing by creating and maintaining detailed and comprehensive documentation, diagrams, and runbooks
  • Respond to and resolve support requests in a timely fashion while managing project timelines and other responsibilities
  • Serve as a frontline support resource to AI Software teams to triage problems and engage relevant engineering support
  • Participate in 24x7 on-call rotation

What You’ll Bring

  • Sound judgement, outstanding communication, & ability to work with internal customers in a fast-paced, high visibility role
  • Proficiency in high-level programming language and/or scripting with (Python, Golang, Bash)
  • Experience with troubleshooting distributed systems
  • Strong knowledge of multiple observability tools : Splunk, Prometheus/Alert Manager, Synthetic Monitoring, Grafana
  • Prior Experience in Catchpoint and/or Kentik a plus
  • Strong understanding of Linux fundamentals (Ubuntu/RHEL OS)
  • Excellent understanding of Network and Traffic fundamentals
  • Experience in collaborating with network and data center teams for large scale infrastructure support
  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position

Compensation and Benefits
Benefits

Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:

  • Aetna PPO and HSA plans > 2 medical plan options with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
  • Company Paid (Health Savings Account) HSA Contribution when enrolled in the High Deductible Aetna medical plan with HSA
  • Healthcare and Dependent Care Flexible Spending Accounts (FSA)
  • LGBTQ+ care concierge services
  • 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
  • Company paid Basic Life, AD&D, short-term and long-term disability insurance
  • Employee Assistance Program
  • Sick and Vacation time (Flex time for salary positions), and Paid Holidays
  • Back-up childcare and parenting support resources
  • Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
  • Weight Loss and Tobacco Cessation Programs
  • Tesla Babies program
  • Commuter benefits
  • Employee discounts and perks program

Expected Compensation
$104,000 - $348,000/annual salary + cash and stock awards + benefits
Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
Tesla is an Equal Opportunity / Affirmative Action employer committed to diversity in the workplace. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, age, national origin, disability, protected veteran status, gender identity or any other factor protected by applicable federal, state or local laws.
Tesla is also committed to working with and providing reasonable accommodations to individuals with disabilities. Please let your recruiter know if you need an accommodation at any point during the interview process.

Responsibilities:

  • Collaborate with a cross-functional team of SRE engineers, architects, and other stakeholders to understand complex application architectures, enabling the implementation of an effective top-down monitoring strategy for holistic service visibility
  • Build, maintain, and monitor dashboards for critical infrastructure
  • Create and tune alerts for network and hardware so that potential problems are identified, routed, and remediated early
  • Facilitate knowledge sharing by creating and maintaining detailed and comprehensive documentation, diagrams, and runbooks
  • Respond to and resolve support requests in a timely fashion while managing project timelines and other responsibilities
  • Serve as a frontline support resource to AI Software teams to triage problems and engage relevant engineering support
  • Participate in 24x7 on-call rotatio


REQUIREMENT SUMMARY

Min:3.0Max:8.0 year(s)

Information Technology/IT

IT Software - Other

Software Engineering

Graduate

Proficient

1

Palo Alto, CA, USA