Senior Site Reliability Engineer (SRE) at Salla

Makkah Al Mukarramah, Makkah Region, Saudi Arabia -

Full Time

Start Date

Immediate

Expiry Date

04 Apr, 26

Salary

0.0

Posted On

04 Jan, 26

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Kubernetes, Service Mesh Technologies, Cloud Platforms, Linux, Networking, Distributed Systems, Load Balancers, Terraform, Observability Tools, Scripting, Programming, CI/CD, GitOps, Debugging, Incident Response, Performance Analysis

Industry

Information Technology & Services

Description

As a Senior SRE at Salla, you will lead reliability initiatives, handle complex incidents, improve platform performance, and guide engineering teams toward building resilient systems. You will also participate in the on-call rotation as part of our commitment to platform reliability. Reliability & Incident Management Lead high-severity incident response and drive post-incident reviews. Troubleshoot complex issues across applications, infrastructure, and networks. Improve MTTR through better monitoring, alerts, and diagnostic tooling. Participate in the on-call rotation supporting production systems. Performance & Scalability Identify and resolve performance bottlenecks and scaling challenges. Conduct load testing and capacity planning for high-traffic scenarios. Infrastructure & Operations Enhance cloud-native infrastructure, deployment processes, and automation. Improve resilience, fault-tolerance, and recovery mechanisms across systems. Observability Build and refine dashboards, alerts, metrics, logs, and traces. Define SLIs/SLOs and improve visibility into system behavior. Tooling & Automation Develop tools that reduce operational toil and increase reliability. Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows. Collaboration Work closely with engineering teams to ensure services are robust and production-ready. Mentor engineers on reliability, debugging, and operational best practices. Required Skills Strong experience with Kubernetes, service mesh technologies, and cloud platforms (AWS/GCP/Azure). Deep understanding of Linux, networking, distributed systems, and load balancers. Hands-on with Terraform or similar IaC tools. Experience with Prometheus, Grafana, Loki, Mimir, Elastic, or similar observability tools. Proficiency in scripting/programming (Bash, Python, Go). Experience with CI/CD and GitOps. Strong debugging, incident response, and performance analysis skills. Bonus Skills Background in large-scale, high-traffic systems. Experience with fault-tolerant design, DR, and HA patterns. Familiarity with SLOs, SLIs, and error budgets. Location Preference Candidates located within GMT 0 to +6 time zones are preferred to align with team collaboration and on-call coverage.

Responsibilities

As a Senior SRE, you will lead reliability initiatives and handle complex incidents while improving platform performance. You will also guide engineering teams toward building resilient systems and participate in the on-call rotation.