Senior Site Reliability Engineer at PayPal

Chicago, Illinois, United States -

Full Time

Start Date

Immediate

Expiry Date

30 Jan, 26

Salary

0.0

Posted On

01 Nov, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, Cloud Infrastructure, DevOps Engineering, Automation Frameworks, Monitoring Infrastructure, Capacity Planning, Disaster Recovery, Containerized Applications, Infrastructure-as-Code, CI/CD Pipelines, Google Cloud Platform, Performance Optimization, Leadership, Interpersonal Skills, Communication Skills, Problem Solving

Industry

Software Development

Description

Take ownership of system performance monitoring, identify inefficiencies, and lead initiatives to improve the overall availability and reliability of digital platforms and applications. Lead and manage the response to complex, high-priority incidents, ensuring prompt resolution and a thorough root cause analysis to prevent future occurrences. Design and implement advanced automation frameworks to improve operational efficiency, streamline processes, and reduce human error. Lead reliability-focused initiatives, ensuring systems are highly available, resilient, and scalable, and promote best practices across engineering teams. Enhance the monitoring infrastructure by identifying key metrics, optimizing alerting, and improving system observability to ensure the reliability of large-scale systems. Forecast resource requirements and lead capacity planning activities to ensure systems can scale effectively to meet growing user demand. Ensure robust disaster recovery strategies are in place and conduct regular testing to ensure systems can recover quickly from failures. Provide mentorship and technical guidance to junior site reliability engineers, fostering skill development and knowledge sharing. Take ownership of system performance monitoring, identify inefficiencies, and lead initiatives to improve the overall availability and reliability of digital platforms and applications. Lead and manage the response to complex, high-priority incidents, ensuring prompt resolution and a thorough root cause analysis to prevent future occurrences. Design and implement advanced automation frameworks to improve operational efficiency, streamline processes, and reduce human error. Lead reliability-focused initiatives, ensuring systems are highly available, resilient, and scalable, and promote best practices across engineering teams. Enhance the monitoring infrastructure by identifying key metrics, optimizing alerting, and improving system observability to ensure the reliability of large-scale systems. Partner with engineering and product teams to identify opportunities for improving system architecture, focusing on scalability, reliability, and fault tolerance. Provide mentorship and technical guidance to junior site reliability engineers, fostering skill development and knowledge sharing. Drive continuous improvement across operational workflows, identifying areas for optimization, cost reduction, and performance enhancement. 3+ years relevant experience and a Bachelor's degree OR Any equivalent combination of education and experience. 3+ years in Cloud Infrastructure, Site Reliability Engineering (SRE), DevOps Engineering, or related fields B.S. or M.S. degree in Computer Science, Engineering, or a related technical field, or equivalent experience may be considered in lieu of degree. At least 2+ years of hands-on experience deploying, managing, and optimizing containerized applications using GKE, and Harness in both public and private cloud environments (AWS, GCP, Azure, etc.), preferably Google Cloud Platform (GCP). 2+ years of hands-on experience with Infrastructure-as-code (Terraform, CloudFormation), CI/CD pipelines (CircleCI, Harness, Jenkins, ArgoCD), and experience in Node, Python, or Go. Strong understanding of using Google Cloud Logging, DataDog, or other monitoring and observability tools. Ability to effectively diagnose and resolve performance bottlenecks within GCP at the infrastructure and application layers. Strong leadership abilities; must have customer focus and commitment to quality. Must have great interpersonal skills; solid communication skills, written and verbal. Ability to remain composed, methodical, and think fast in a high-pressure environment. Experience in managing, collaborating, and influencing global teams. Must be organized, detail-oriented, and able to manage multiple tasks simultaneously with the ability to appropriately prioritize.

Responsibilities

Take ownership of system performance monitoring and lead initiatives to improve the overall availability and reliability of digital platforms. Manage responses to complex incidents and enhance monitoring infrastructure to ensure system reliability.