Senior Site Reliability Engineer at EPAM Systems Inc
Remote, British Columbia, Canada -
Full Time


Start Date

Immediate

Expiry Date

09 Dec, 25

Salary

135000.0

Posted On

10 Sep, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Server Administration, Docker, Postgresql, Aws, Code, Security, Kubernetes, Production Systems

Industry

Information Technology/IT

Description

EPAM is uniquely positioned to become this client’s key technology partner, with the opportunity to scale the account to over 100 team members within just a few months. Our core team has already established strong relationships with key executives and product owners, paving the way for EPAM to contribute meaningfully to the client’s objectives.
This is a fast-paced and dynamic role, offering exciting opportunities for professional growth and learning due to the breadth of technologies involved and the importance of the client. The initial scope of work is expected to span approximately six months, with significant potential for long-term engagement and meaningful contributions from the right candidates.
Req.#875770562

REQUIREMENTS

  • Experience: 4–5 years in Site Reliability Engineering or Systems Engineering roles, with 2+ years supporting critical enterprise production systems
  • Intermediate proficiency in SQL/RDBMS tools (e.g., MySQL, PostgreSQL, or similar)
  • Comfortable with Linux/Bash and Windows Server administration
  • Strong Python development skills (mandatory)
  • Familiarity with configuration management tools like Chef or similar
  • Experience with public cloud platforms (AWS required, OCI experience is advantageous)
  • Proficiency with Terraform or similar Infrastructure-as-Code tools
  • Experience with containerization technologies such as Docker or Kubernetes
  • Hands-on experience with monitoring and logging tools like Prometheus, Grafana, and ELK Stack
  • Knowledge of security best practices for cloud and hybrid infrastructure environments
Responsibilities
  • Ensure systems and services function reliably with minimal downtime
  • Monitor uptime, latency, and overall health of production systems
  • Implement and enforce Service Level Objectives (SLOs) and Service Level Agreements (SLAs) for applications and systems
  • Set up monitoring and alerting tools to track system performance and detect anomalies
  • Respond to incidents, troubleshoot production issues, and restore services as quickly as possible
  • Automate repetitive tasks, including deployment, scaling, and monitoring
  • Build and maintain infrastructure-as-code (IaC) using tools like Terraform, Ansible, or CloudFormation
Loading...