Site Reliability Engineer - SRE at SPS Commerce

Brampton, ON L6T 4B5, Canada -

Full Time

Start Date

Immediate

Expiry Date

08 Jul, 25

Salary

71300.0

Posted On

08 Apr, 25

Experience

1 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Ec2, Kubernetes, Infrastructure, Amazon Web Services, Task Execution, Python, Linux, It, Logging, Docker

Industry

Information Technology/IT

Description

Description:
SPS Commerce is a leading provider of cloud-based supply chain management solutions, serving a global network of retail trading partners. We foster a collaborative and inclusive work environment where innovation and continuous improvement are highly valued. Join SPS Commerce and be part of a dynamic team that’s transforming the global retail supply chain!

POSITION SUMMARY:

You would join the SPS Commerce Site Reliability Engineering (SRE) team who is responsible for delivering highly available platform services and deployment automation that empower our product engineering teams with services that are secure, reliable, cost effective, and enable product speed.
You would work within a fast-paced and collaborative environment, and you would partner with the development team to deliver market leading products and services. This role uses automation and other technologies to intelligently cope with challenging failures while collaborating with various engineering organizations to resolve failure risks at the source.
The SRE team at SPS approaches Operations as a software problem and aims to apply software engineering approaches to those problems.

REQUIRED QUALIFICATIONS:

2+ years IT experience with a Bachelor’s degree; or 1 years and a Master’s degree; or a PhD without experience; or equivalent work experience
Experience with Golang
Experience in platform and service mesh technologies such as Docker, Kubernetes,
Experience administering Linux
Experience with Amazon Web Services (EC2, IAM, Route53, Cloud Formation)
Experience with immutable and scalable infrastructure (infrastructure as code concepts)
Demonstrated understanding of networking systems, various identity and authorization systems

PREFERRED EXPERIENCE

Experience working with Agile development methodology and task execution
Experience building or operating CI/CD pipelines or other deployment automation solutions
Experience with advanced monitoring solutions such as metrics platforms, logging, distributed tracing
Experience developing in Python
Experience with Terraform

Responsibilities

Engineer and maintain highly available, secure, and cost-effective container orchestration platforms such as Kubernetes and ECS
Engineer Continuous Integration & Continuous Delivery (CI/CD) solutions that simplify and improve software deployments to enable high velocity for our Product Engineering partners
Develop robust monitoring and observability services and patterns to consistently improve the team’s ability to identify, react, respond, and recover from complex failures
Collaborate with Technology Engineering, Development, and Product Management to help develop, scale, and improve production systems and services
Partner with service teams to provide appropriate documentation, cross-training, architecture planning, capacity management, and recommendations for future state
Engineer technical solutions to prevent or reduce the frequency of failures
Write clean and correct code, write test plans and identify code quality improvements when reviewing code