SITE RELIABILITY ENGINEER IV (HYBRID) at Green Shield

Toronto, ON M2N 6L7, Canada -

Full Time

Start Date

Immediate

Expiry Date

08 Oct, 25

Salary

0.0

Posted On

08 Jul, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Incident Response, Recovery Plans, Investigation, Root, Reliability, Maintenance, Scalability

Industry

Information Technology/IT

Description

WHO WE ARE

When it comes to health, we’re always looking for ways to push for better. It’s why we were founded in the first place. In 1957, our founder, pharmacist William Wilkinson, witnessed a mother sacrifice her health by forgoing her own medicine to pay for her sick daughter’s prescription. He knew there had to be a better way. So, he introduced North America’s first prepaid drug plan, and GreenShield was born as a not-for-profit with a mission to support better health for all Canadians.
We aren’t just a health and benefits company. We’re the only not-for-profit social enterprise that brings worlds of coverage and care together, all in one place.
We’re noble challengers, purposefully building a better way and we need the best people to help us create a more holistic approach that takes care of the mind and body.
Our mission is to create better health for all Canadians, and we know that starts with our employees.

A Site Reliability Engineer (SRE) works with development and operations teams to ensure the operation, reliability, performance and scalability of GreenShield Labs core systems that are the backbone of the GreenShield+ experience. A successful candidate has a working knowledge of the inner workings of Google Cloud infrastructure and has experience applying SRE capabilities in observability, monitoring, analysis and incident response. Working for the Engineering Manager, SRE they will:

Enhance service observability and monitoring in a cloud-first environment predominantly in the GCP ecosystemSetup, configure, and maintain automated tooling and alerts designed to monitor and report on application Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Create and update issue and incident playbooks around service disruptions and aid in root cause analysis and improvement activities
Add to the on-call capabilities for incident response and investigation while escalating as needed to the application development teams
Develop and test disaster recovery plans and resiliency events to ensure continuous and reliable operations and processes
Collaborate with the application development teams on pipeline creation and maintenance for zero-downtime delivery of the applications
Help design and implement phased rollout approaches for release

Responsibilities

Enhance service observability and monitoring in a cloud-first environment predominantly in the GCP ecosystemSetup, configure, and maintain automated tooling and alerts designed to monitor and report on application Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Create and update issue and incident playbooks around service disruptions and aid in root cause analysis and improvement activities
Add to the on-call capabilities for incident response and investigation while escalating as needed to the application development teams
Develop and test disaster recovery plans and resiliency events to ensure continuous and reliable operations and processes
Collaborate with the application development teams on pipeline creation and maintenance for zero-downtime delivery of the applications
Help design and implement phased rollout approaches for releases