Site Reliability Engineer III at JPMC Candidate Experience page

Hyderabad, Telangana, India -

Full Time

Start Date

Immediate

Expiry Date

11 Mar, 26

Salary

0.0

Posted On

11 Dec, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, DevOps, Application Support, Incident Management, Automation, Monitoring, CI/CD Pipelines, Infrastructure as Code, Version Control, Containerization, Cloud Platforms, Troubleshooting, Collaboration, Continuous Improvement, Customer Support, Documentation

Industry

Financial Services

Description

There’s nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems. As a Site Reliability Engineer III at JPMorgan Chase within the Chief Technology Office, you will solve complex and broad business problems with simple and straightforward solutions. We are seeking a Site Reliability Engineer (SRE) to help drive reliable, scalable, and intelligent platform operations in a global financial environment. This role combines technical support, DevOps practices, and SRE principles—including on-call incident response, automation, and a customer-first mindset. You will work with modern tools to ensure our applications and services remain robust and available. Job Responsibilities Collaborate with engineering, support, and operations teams to maintain and improve the reliability of mission-critical applications. Participate in incident management, troubleshooting, and continuous improvement initiatives. Implement automation and monitoring solutions to enhance system reliability. Join an on-call rotation and respond effectively to production incidents. Share knowledge and follow best practices to foster a culture of learning and innovation. Communicate clearly with stakeholders and proactively solve problems. Focus on customer needs and deliver high-quality support. Document solutions and incident responses for future reference. Analyze system performance and recommend improvements. Contribute to post-incident reviews and drive process enhancements. Support the integration of new tools and technologies to improve operational efficiency. Required Qualifications, Capabilities, and Skills Formal training or certification on SRE and Application Support concepts and 3+ years applied experience Demonstrate experience in SRE, DevOps, or application support roles, including knowledge of SLIs, SLOs, incident response, and troubleshooting. Utilize monitoring and observability tools such as Grafana, Prometheus, Splunk, and Open Telemetry. Apply hands-on experience with CI/CD pipelines (Jenkins, including global libraries), infrastructure as code (Terraform), version control (Git), containerization (Docker), and orchestration (Kubernetes). Work with cloud platforms such as AWS, GCP, or Azure, and automate infrastructure and deployments. Participate in on-call rotation and respond to production incidents. Break down complex issues, document solutions, and communicate effectively with team members and customers. Implement automation and monitoring solutions to support operational goals. Collaborate with cross-functional teams to resolve incidents and improve reliability. Contribute to continuous improvement of support processes and system performance. Preferred Qualifications, Capabilities, and Skills Demonstrate experience in banking, fintech, or regulated environments. Participate in resilience engineering activities such as game days or chaos engineering. Mentor peers by sharing knowledge and best practices. Contribute to the adoption of innovative tools and approaches in support operations.

Responsibilities

The Site Reliability Engineer III will collaborate with various teams to maintain and improve the reliability of mission-critical applications while participating in incident management and continuous improvement initiatives. The role also involves implementing automation and monitoring solutions to enhance system reliability and responding to production incidents.