Manager, Site Reliability Engineering (SRE) - eCommerce

at  The Home Depot Canada

Toronto, ON, Canada -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate31 Jan, 2025Not Specified31 Oct, 20244 year(s) or aboveLarge Scale Systems,Architecture,Computer Science,VendorsNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

WITH A CAREER AT THE HOME DEPOT, YOU CAN BE YOURSELF AND ALSO BE PART OF SOMETHING BIGGER.

Position Overview:
The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.

Responsibilities:

  • Leadership & Management:
  • Lead and mentor a team of Site Reliability Engineers
  • Foster a culture of continuous improvement and innovation
  • Collaborate with cross-functional teams to align SRE practices with business objectives
  • Reliability & Performance:
  • Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
  • Implement and promote performance engineering practices to ensure optimal system performance on GCP
  • Develop and maintain service level objectives (SLOs) and error budgets
  • Production Engineering & Operational Support:
  • Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
  • Manage incident response and post-incident reviews to minimize downtime and improve system resilience
  • Implement monitoring, alerting, and observability solutions to proactively identify and address issues
  • Develop and maintain runbooks and playbooks for common operational tasks.
  • Coordinate with security teams to ensure compliance with security policies and best practice
  • DevOps & Continuous Improvement:
  • Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
  • Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
  • Identify and implement automation opportunities to reduce manual tasks and improve efficiency
  • Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
  • Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
  • Release Management:
  • Implement and maintain release management best practices to minimize disruptions and maximize system stability
  • Collaborate with DevOps teams to integrate release management into CI/CD pipelines
  • Oversee release schedules, ensuring minimal impact on business operations
  • Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
  • Maintain a release calendar and communicate release plans to stakeholders
  • Strategic Planning:
  • Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
  • Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
  • Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
  • Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
  • Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology.

Experience:

  • Bachelor’s degree in computer science, Engineering, or a related field
  • Strong problem-solving and analytical abilities
  • Excellent communication and collaboration skills
  • 4-6 years of relevant work experience, including significant experience with GCP
  • Extensive experience with cloud infrastructure, GCP services and architecture
  • Proven track record of managing and optimizing large-scale systems on GCP
  • Proven ability to effectively communicate with individuals at all levels of the organization
  • Ability to maintain relationship and negotiate with vendors.
  • Ability to operate in and leverage resources in a matrixed environment.
  • Ability to analyze and present data to support ideas.
  • Ability to clearly communicate to all levels of the organization.

Responsibilities:

  • Leadership & Management:
  • Lead and mentor a team of Site Reliability Engineers
  • Foster a culture of continuous improvement and innovation
  • Collaborate with cross-functional teams to align SRE practices with business objectives
  • Reliability & Performance:
  • Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
  • Implement and promote performance engineering practices to ensure optimal system performance on GCP
  • Develop and maintain service level objectives (SLOs) and error budgets
  • Production Engineering & Operational Support:
  • Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
  • Manage incident response and post-incident reviews to minimize downtime and improve system resilience
  • Implement monitoring, alerting, and observability solutions to proactively identify and address issues
  • Develop and maintain runbooks and playbooks for common operational tasks.
  • Coordinate with security teams to ensure compliance with security policies and best practice
  • DevOps & Continuous Improvement:
  • Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
  • Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
  • Identify and implement automation opportunities to reduce manual tasks and improve efficiency
  • Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
  • Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
  • Release Management:
  • Implement and maintain release management best practices to minimize disruptions and maximize system stability
  • Collaborate with DevOps teams to integrate release management into CI/CD pipelines
  • Oversee release schedules, ensuring minimal impact on business operations
  • Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
  • Maintain a release calendar and communicate release plans to stakeholders
  • Strategic Planning:
  • Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
  • Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
  • Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
  • Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
  • Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology


REQUIREMENT SUMMARY

Min:4.0Max:6.0 year(s)

Information Technology/IT

IT Software - Other

Other

Graduate

Computer science engineering or a related field

Proficient

1

Toronto, ON, Canada