Manager, Site Reliability Engineering (SRE) - eCommerce
at The Home Depot Canada
Toronto, ON, Canada -
Start Date | Expiry Date | Salary | Posted On | Experience | Skills | Telecommute | Sponsor Visa |
---|---|---|---|---|---|---|---|
Immediate | 31 Jan, 2025 | Not Specified | 31 Oct, 2024 | 4 year(s) or above | Large Scale Systems,Architecture,Computer Science,Vendors | No | No |
Required Visa Status:
Citizen | GC |
US Citizen | Student Visa |
H1B | CPT |
OPT | H4 Spouse of H1B |
GC Green Card |
Employment Type:
Full Time | Part Time |
Permanent | Independent - 1099 |
Contract – W2 | C2H Independent |
C2H W2 | Contract – Corp 2 Corp |
Contract to Hire – Corp 2 Corp |
Description:
WITH A CAREER AT THE HOME DEPOT, YOU CAN BE YOURSELF AND ALSO BE PART OF SOMETHING BIGGER.
Position Overview:
The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.
Responsibilities:
- Leadership & Management:
- Lead and mentor a team of Site Reliability Engineers
- Foster a culture of continuous improvement and innovation
- Collaborate with cross-functional teams to align SRE practices with business objectives
- Reliability & Performance:
- Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
- Implement and promote performance engineering practices to ensure optimal system performance on GCP
- Develop and maintain service level objectives (SLOs) and error budgets
- Production Engineering & Operational Support:
- Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
- Manage incident response and post-incident reviews to minimize downtime and improve system resilience
- Implement monitoring, alerting, and observability solutions to proactively identify and address issues
- Develop and maintain runbooks and playbooks for common operational tasks.
- Coordinate with security teams to ensure compliance with security policies and best practice
- DevOps & Continuous Improvement:
- Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
- Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
- Identify and implement automation opportunities to reduce manual tasks and improve efficiency
- Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
- Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
- Release Management:
- Implement and maintain release management best practices to minimize disruptions and maximize system stability
- Collaborate with DevOps teams to integrate release management into CI/CD pipelines
- Oversee release schedules, ensuring minimal impact on business operations
- Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
- Maintain a release calendar and communicate release plans to stakeholders
- Strategic Planning:
- Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
- Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
- Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
- Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
- Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology.
Experience:
- Bachelor’s degree in computer science, Engineering, or a related field
- Strong problem-solving and analytical abilities
- Excellent communication and collaboration skills
- 4-6 years of relevant work experience, including significant experience with GCP
- Extensive experience with cloud infrastructure, GCP services and architecture
- Proven track record of managing and optimizing large-scale systems on GCP
- Proven ability to effectively communicate with individuals at all levels of the organization
- Ability to maintain relationship and negotiate with vendors.
- Ability to operate in and leverage resources in a matrixed environment.
- Ability to analyze and present data to support ideas.
- Ability to clearly communicate to all levels of the organization.
Responsibilities:
- Leadership & Management:
- Lead and mentor a team of Site Reliability Engineers
- Foster a culture of continuous improvement and innovation
- Collaborate with cross-functional teams to align SRE practices with business objectives
- Reliability & Performance:
- Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
- Implement and promote performance engineering practices to ensure optimal system performance on GCP
- Develop and maintain service level objectives (SLOs) and error budgets
- Production Engineering & Operational Support:
- Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
- Manage incident response and post-incident reviews to minimize downtime and improve system resilience
- Implement monitoring, alerting, and observability solutions to proactively identify and address issues
- Develop and maintain runbooks and playbooks for common operational tasks.
- Coordinate with security teams to ensure compliance with security policies and best practice
- DevOps & Continuous Improvement:
- Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
- Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
- Identify and implement automation opportunities to reduce manual tasks and improve efficiency
- Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
- Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
- Release Management:
- Implement and maintain release management best practices to minimize disruptions and maximize system stability
- Collaborate with DevOps teams to integrate release management into CI/CD pipelines
- Oversee release schedules, ensuring minimal impact on business operations
- Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
- Maintain a release calendar and communicate release plans to stakeholders
- Strategic Planning:
- Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
- Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
- Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
- Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
- Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology
REQUIREMENT SUMMARY
Min:4.0Max:6.0 year(s)
Information Technology/IT
IT Software - Other
Other
Graduate
Computer science engineering or a related field
Proficient
1
Toronto, ON, Canada