Senior Site Reliability Engineer at EPAM Systems Inc

Desde casa, Cauca, Colombia -

Full Time

Start Date

Immediate

Expiry Date

03 Aug, 25

Salary

200.0

Posted On

03 May, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Working Experience, Operational Support, Application Monitoring, B2, Communication Skills, Aws

Industry

Information Technology/IT

Description

We are seeking a highly skilled and passionate Senior Site Reliability Engineer to join our team and support a cutting-edge supply-chain data analytics platform. In this role, you will collaborate directly with the client’s engineering teams to ensure reliability, scalability, and efficiency of infrastructure and operations while driving innovation and automation.
This is an exciting opportunity to work with a forward-thinking startup that values innovation, efficiency, and collaboration. If you’re passionate about reliability engineering and thrive in a dynamic environment, we’d love to hear from you!
We accept CVs in English only.

REQUIREMENTS

3+ years of experience in operational DevOps/SRE roles
Proven expertise in cloud and application monitoring, specifically with Datadog
Working experience with AWS and Terraform
Proficiency in Python scripting for task automation and tool development
Demonstrated background in working alongside software development teams, learning applications, and providing operational support
Capability to work independently and confidently in tackling ambiguous, open-ended problems
Strong communication skills with the ability to proactively collaborate and articulate ideas
Experience in client-facing positions
Desire to work in a transparent and quickly moving startup environment
English level – B2 or higher, both spoken and written

Responsibilities

Improve monitoring systems, with a strong focus on cloud and application monitoring using Datadog
Manage and optimize infrastructure provisioning and automation through AWS and Terraform
Develop and maintain Python scripts to automate tasks and build tools that enhance operational efficiency
Collaborate closely with software development teams to understand applications, advise on operational improvements, and contribute to engineering processes
Research and implement best practices for reliability and scalability across infrastructure systems
Troubleshoot and resolve critical application and system issues to minimize downtime
Ensure smooth functioning of cloud-based infrastructures while optimizing for performance, security, and cost efficiency
Drive process improvements through automation and infrastructure-as-code principles
Work independently on open-ended, ambiguous tasks with the ability to deliver actionable outcomes
Interface with clients and stakeholders to understand requirements, deliver solutions, and foster strong relationships
Contribute to a dynamic, fast-paced startup environment with flexibility and focus on team goals