Senior Engineer - SRE at Weekday AI

Hyderabad, Telangana, India -

Full Time

Start Date

Immediate

Expiry Date

13 Apr, 26

Salary

0.0

Posted On

13 Jan, 26

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, DevOps, Platform Engineering, Kubernetes, Terraform, Ansible, AWS, GCP, Azure, CI/CD, Python, Bash, Linux, Networking, Monitoring, Troubleshooting

Industry

technology;Information and Internet

Description

This role is for one of the Weekday's clients Min Experience: 5 years Location: hyderabad JobType: full-time We are looking for a highly skilled and motivated Senior Engineer – Site Reliability Engineering (SRE) to join our growing engineering team. In this role, you will be responsible for ensuring the reliability, scalability, performance, and availability of mission-critical systems across multi-cloud environments. You will work closely with platform, infrastructure, and application teams to build resilient systems using automation-first and cloud-native best practices. This role is ideal for someone who is passionate about operational excellence, enjoys solving complex infrastructure challenges, and thrives in fast-paced, high-availability environments. Key Responsibilities Design, build, and operate highly available, scalable, and fault-tolerant systems using SRE principles and best practices Manage and operate containerized workloads using Kubernetes, including cluster setup, upgrades, monitoring, and troubleshooting Implement and maintain Infrastructure as Code (IaC) using Terraform and configuration management using Ansible Support and optimize cloud infrastructure across AWS, GCP, and Azure, ensuring cost efficiency, security, and performance Build, maintain, and enhance CI/CD pipelines to enable reliable and automated application deployments Develop automation scripts and tools using Python and Bash to reduce manual operations and improve system reliability Define and track SLIs, SLOs, and SLAs, and participate in error budget planning and incident response Lead incident management, root cause analysis (RCA), and post-mortem reviews to drive continuous improvement Implement monitoring, alerting, and observability solutions to proactively detect and resolve issues Collaborate with development teams to improve system design, deployment processes, and operational readiness Mentor junior engineers and contribute to SRE standards, documentation, and best practices Required Skills & Qualifications 5–10 years of hands-on experience in Site Reliability Engineering, DevOps, or Platform Engineering roles Strong expertise in Kubernetes and container orchestration in production environments Proven experience with Terraform and Ansible for infrastructure provisioning and configuration management Extensive experience working with at least one major cloud provider (AWS, GCP, or Azure); multi-cloud experience is a strong plus Deep understanding of CI/CD systems, deployment strategies, and release automation Strong scripting and automation skills using Python and Bash Solid understanding of Linux systems, networking, and distributed systems concepts Experience with monitoring, logging, and alerting tools (Prometheus, Grafana, ELK, or similar) Strong troubleshooting skills and experience handling production incidents Nice to Have Experience with security, compliance, and cloud cost optimization Knowledge of service meshes, load balancing, and auto-scaling strategies Prior experience in high-scale or high-availability production systems

Responsibilities

The Senior Engineer - SRE will design, build, and operate highly available systems while managing containerized workloads and optimizing cloud infrastructure. They will also lead incident management and collaborate with development teams to enhance operational readiness.