Site Reliability Engineer at NTT DATA
Noida, Uttar Pradesh, India -
Full Time


Start Date

Immediate

Expiry Date

14 Mar, 26

Salary

0.0

Posted On

14 Dec, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Site Reliability Engineering, DevOps, Observability, Splunk, Monitoring, Incident Management, Root Cause Analysis, SLIs, SLOs, Cloud Platforms, Container Technologies, Scripting, OpenTelemetry, Infrastructure as Code, Microservices Architecture, CI/CD Pipelines

Industry

IT Services and IT Consulting

Description
Implement and maintain observability across metrics, logs, traces, and events. Build and optimize monitoring dashboards and service health indicators using Splunk or similar tools. Configure, fine-tune, and maintain proactive alerts with high signal-to-noise ratio. Lead incident response, conduct root cause analysis (RCA), and drive long-term corrective measures. Define, measure, and enhance SLIs, SLOs, reliability KPIs, and error budgets. Improve system performance, scalability, and availability across environments. Automate monitoring, alerting, and operational workflows to reduce manual toil. Standardize and maintain telemetry instrumentation across services. Own and optimize logging pipelines, ingestion, parsing, indexing, and retention. Collaborate with engineering teams to integrate reliability best practices into application development. Participate in on-call rotations and ensure timely incident resolution. Partner with cloud/platform teams to enhance deployment readiness and operational stability. 5-8 years of experience in SRE, DevOps, or system reliability roles. Strong hands-on experience with Splunk (queries, dashboards, alerts, ingestion). Solid understanding of observability tools (Splunk, Prometheus, Grafana, Datadog, OpenTelemetry, etc.). Strong knowledge of Linux, networking fundamentals, and distributed systems. Experience with cloud platforms (AWS / Azure / GCP) and container technologies (Docker, Kubernetes). Proficiency in scripting (Python, Shell, or similar). Experience with production on-call environments and incident management. Familiarity with SLIs/SLOs, capacity planning, and reliability engineering concepts. Experience with OpenTelemetry-based instrumentation. Exposure to APM tools (Dynatrace, AppDynamics, New Relic). Knowledge of IaC tools like Terraform or Ansible. Understanding of microservices architecture and CI/CD pipelines.
Responsibilities
Implement and maintain observability across metrics, logs, traces, and events. Lead incident response, conduct root cause analysis, and drive long-term corrective measures.
Loading...