Senior Site Reliability Engineer at EPAM Systems Inc
Desde casa, Cauca, Colombia -
Full Time


Start Date

Immediate

Expiry Date

03 Aug, 25

Salary

200.0

Posted On

03 May, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Working Experience, Operational Support, Application Monitoring, B2, Communication Skills, Aws

Industry

Information Technology/IT

Description

We are seeking a highly skilled and passionate Senior Site Reliability Engineer to join our team and support a cutting-edge supply-chain data analytics platform. In this role, you will collaborate directly with the client’s engineering teams to ensure reliability, scalability, and efficiency of infrastructure and operations while driving innovation and automation.
This is an exciting opportunity to work with a forward-thinking startup that values innovation, efficiency, and collaboration. If you’re passionate about reliability engineering and thrive in a dynamic environment, we’d love to hear from you!
We accept CVs in English only.

REQUIREMENTS

  • 3+ years of experience in operational DevOps/SRE roles
  • Proven expertise in cloud and application monitoring, specifically with Datadog
  • Working experience with AWS and Terraform
  • Proficiency in Python scripting for task automation and tool development
  • Demonstrated background in working alongside software development teams, learning applications, and providing operational support
  • Capability to work independently and confidently in tackling ambiguous, open-ended problems
  • Strong communication skills with the ability to proactively collaborate and articulate ideas
  • Experience in client-facing positions
  • Desire to work in a transparent and quickly moving startup environment
  • English level – B2 or higher, both spoken and written
Responsibilities
  • Improve monitoring systems, with a strong focus on cloud and application monitoring using Datadog
  • Manage and optimize infrastructure provisioning and automation through AWS and Terraform
  • Develop and maintain Python scripts to automate tasks and build tools that enhance operational efficiency
  • Collaborate closely with software development teams to understand applications, advise on operational improvements, and contribute to engineering processes
  • Research and implement best practices for reliability and scalability across infrastructure systems
  • Troubleshoot and resolve critical application and system issues to minimize downtime
  • Ensure smooth functioning of cloud-based infrastructures while optimizing for performance, security, and cost efficiency
  • Drive process improvements through automation and infrastructure-as-code principles
  • Work independently on open-ended, ambiguous tasks with the ability to deliver actionable outcomes
  • Interface with clients and stakeholders to understand requirements, deliver solutions, and foster strong relationships
  • Contribute to a dynamic, fast-paced startup environment with flexibility and focus on team goals
Loading...