Site Reliability Engineer, Consultant at Blue Shield of California

Oakland, CA 94607, USA -

Full Time

Start Date

Immediate

Expiry Date

06 Dec, 25

Salary

0.0

Posted On

07 Sep, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Python, Java, Bash, Powershell, Containerization, Kubernetes, Azure, Dynatrace, New Relic, Splunk, Infrastructure Optimization, Production Systems, Orchestration, Solarwinds, Computer Science, Docker, Scripting Languages

Industry

Information Technology/IT

Description

YOUR WORK

In this role, you will:

Design, build, and maintain scalable infrastructure using modern cloud technologies
Develop automation tools to streamline operations and reduce manual effort.
Implement and manage CI/CD pipelines for rapid and reliable software delivery.
Monitor system performance and availability using advanced observability tools.
Conduct root cause analysis and postmortems for production incidents.
Define and enforce service-level objectives (SLOs) and error budgets.
Collaborate with engineering teams to improve system architecture and reliability.

YOUR KNOWLEDGE AND EXPERIENCE

Requires a BS degree in computer science or equivalent field with 5+ years or MS degree
Requires 7+ years experience, engineering and/or operating production systems or equivalent
Cloud Platforms: Azure, AWS, GCP.
Programming & Scripting Languages: Python, Go, Java, Bash, PowerShell or similar.
Containerization & Orchestration: Red Hat OpenShift, Kubernetes, Docker, Helm.
Monitoring & Observability: Prometheus, Grafana, Datadog, New Relic, ELK Stack, Dynatrace, Splunk, Big Panda, SolarWinds.
CI/CD & Configuration Management: Jenkins, GitHub Actions, GitLab CI, Argo CD, Spinnaker, Ansible, Chef, Puppet.
Intelligent Automation & Agentic Systems: Familiarity with Agentic AI systems and autonomous workflows for incident resolution, observability, and infrastructure optimization.

LI-EB1

PHYSICAL REQUIREMENTS:

Office Environment - roles involving part to full time schedule in Office Environment. Based in our physical offices and work from home office/deskwork - Activity level: Sedentary, frequency most of work day.

Responsibilities

YOUR ROLE

The Technology Operations Center (TOC) team provides 24 x 7 coverage of observability monitoring events including batch operations to assure successful execution and completion of critical business services, within required timelines. The Site Reliability Engineer will report to the Manager, TOC. In this role you will be responsible for reliability, scalability, and performance of our infrastructure and applications. You will work closely with development and operations teams to automate processes, monitor systems, and respond to incidents. Our leadership model is about developing great leaders at all levels and creating opportunities for our people to grow – personally, professionally, and financially. We are looking for leaders that are energized by creative and critical thinking, building and sustaining high-performing teams, getting results the right way, and fostering continuous learning.

In this role, you will:

Design, build, and maintain scalable infrastructure using modern cloud technologies
Develop automation tools to streamline operations and reduce manual effort.
Implement and manage CI/CD pipelines for rapid and reliable software delivery.
Monitor system performance and availability using advanced observability tools.
Conduct root cause analysis and postmortems for production incidents.
Define and enforce service-level objectives (SLOs) and error budgets.
Collaborate with engineering teams to improve system architecture and reliability

LI-EB1