Staff Site Reliability Engineer (Mobile) at PayPal

San Jose, California, United States -

Full Time

Start Date

Immediate

Expiry Date

14 Jan, 26

Salary

0.0

Posted On

16 Oct, 25

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, Mobile Systems, iOS, Android, Datadog, SLIs, SLOs, Incident Response, Automation, Python, Go, Swift, Kotlin, CI/CD, Bazel, Gradle

Industry

Software Development

Description

Acts as a project or system leader, coordinating the activities of other engineers on the project or within the system Determines the technical tasks that other engineers will follow Actions result in the achievement of customer, operational, program or service objectives Proactively improves existing structures & processes Exercises judgement in reconciling diverse and competing priorities (time, quality, complexity risk) to identify optimal solutions to problems Notices patterns and condenses repetition into densely meaningful generalized solutions Collaborates with management to set/improve standards for engineering rigor Standards & Governance Define mobile-specific SLIs and SLOs (e.g., crash-free sessions, ANRs, app startup time, network success rates, battery/memory usage). Establish best practices for observability, alerting, and incident response in Datadog. Drive consistency in how mobile reliability is measured and tracked across teams. Automated regression detection and performance benchmarking. Ensure tooling aligns with existing systems (Harness for CI/CD, Gradle/Bazel for builds). Provide technical leadership in complex incidents, ensuring issues are addressed efficiently and lessons are institutionalized. Establish long-term roadmaps for mobile reliability and influence company-wide reliability goals. Minimum of 8 years of relevant work experience and a Bachelor's degree or equivalent experience. Act as the primary liaison with backend/web SRE leadership to ensure seamless incident response and shared visibility into cross-system issues. Partner with Release Engineering, QA, and Product to ensure new features meet operational readiness requirements. Influence architecture and design decisions to prioritize mobile reliability at the planning stage. Required 8+ years of experience in software engineering, SRE, or mobile systems roles. Strong understanding of iOS and/or Android performance and reliability challenges. Hands-on experience with Datadog (or equivalent observability platforms) for monitoring, alerting, and dashboards. Proven ability to define and implement SLIs/SLOs across complex, distributed systems. Experience leading on-call rotations, incident response, and postmortems. Demonstrated experience building automation and internal tools for reliability. Strong programming skills in Python, Go, or similar, with working knowledge of Swift/Kotlin for client instrumentation. Exceptional ability to influence and partner across engineering, product, and SRE orgs. Track record of mentoring engineers and leading distributed teams. Experience with CI/CD for mobile (Harness, Fastlane, Jenkins). Familiarity with Bazel and Gradle build systems. Prior experience introducing cultural changes (e.g., adopting on-call or reliability practices) in a development org. Strong knowledge of backend service reliability concepts, to bridge between client and server. Mobile reliability metrics are well-defined, tracked, and consistently met. Mobile development teams have adopted on-call and alerting practices, with clear operational ownership. Automation and tools are in place to ensure release health, regression detection, and efficient incident response. Cross-team incidents between mobile and backend SRE are handled smoothly, with no ownership gaps. The mobile user experience measurably improves through fewer crashes, faster startup times, and improved stability. The Mobile SRE team is seen as a trusted partner to both backend SRE and mobile engineering.

Responsibilities

The Staff Site Reliability Engineer coordinates engineering activities and improves existing processes to achieve operational objectives. They provide technical leadership during incidents and establish long-term roadmaps for mobile reliability.