Sign up with

Already have an account? Log in here

Need some help?
Talk to us at +91 7670800001

Senior Site Reliability Engineer at EPAM Systems Inc

Remote, British Columbia, Canada -

Full Time

Start Date

Immediate

Expiry Date

09 Dec, 25

Salary

135000.0

Posted On

10 Sep, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Server Administration, Docker, Postgresql, Aws, Code, Security, Kubernetes, Production Systems

Industry

Information Technology/IT

Description

EPAM is uniquely positioned to become this client’s key technology partner, with the opportunity to scale the account to over 100 team members within just a few months. Our core team has already established strong relationships with key executives and product owners, paving the way for EPAM to contribute meaningfully to the client’s objectives.
This is a fast-paced and dynamic role, offering exciting opportunities for professional growth and learning due to the breadth of technologies involved and the importance of the client. The initial scope of work is expected to span approximately six months, with significant potential for long-term engagement and meaningful contributions from the right candidates.
Req.#875770562

REQUIREMENTS

Experience: 4–5 years in Site Reliability Engineering or Systems Engineering roles, with 2+ years supporting critical enterprise production systems
Intermediate proficiency in SQL/RDBMS tools (e.g., MySQL, PostgreSQL, or similar)
Comfortable with Linux/Bash and Windows Server administration
Strong Python development skills (mandatory)
Familiarity with configuration management tools like Chef or similar
Experience with public cloud platforms (AWS required, OCI experience is advantageous)
Proficiency with Terraform or similar Infrastructure-as-Code tools
Experience with containerization technologies such as Docker or Kubernetes
Hands-on experience with monitoring and logging tools like Prometheus, Grafana, and ELK Stack
Knowledge of security best practices for cloud and hybrid infrastructure environments

Responsibilities

Ensure systems and services function reliably with minimal downtime
Monitor uptime, latency, and overall health of production systems
Implement and enforce Service Level Objectives (SLOs) and Service Level Agreements (SLAs) for applications and systems
Set up monitoring and alerting tools to track system performance and detect anomalies
Respond to incidents, troubleshoot production issues, and restore services as quickly as possible
Automate repetitive tasks, including deployment, scaling, and monitoring
Build and maintain infrastructure-as-code (IaC) using tools like Terraform, Ansible, or CloudFormation