Senior Site Reliability Engineer at Microsoft

Redmond, Washington, United States -

Full Time

Start Date

Immediate

Expiry Date

03 Mar, 26

Salary

0.0

Posted On

03 Dec, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, Software Engineering, Network Engineering, Systems Administration, Cloud Services, Service Reliability, Observability, Logic Apps, Jupyter Notebooks, Root Cause Analysis, Problem-Solving, Communication Skills, Customer Support, Automation, Data Analysis, Incident Management

Industry

Software Development

Description

Collaborating closely with engineering teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO's and averting incidents altogether when possible. Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a sustainable way. Communicate on a deeply technical level and be the single point of contact for interfacing with enterprise customers for handling service escalations and driving the issues to resolution. Ability to design and implement any changes to service telemetry for the automation to consume if it is not already available. Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc. Analyze data and provide operational insights into customer experience to design and product teams, so that we can design features with supportability in mind. Embody our culture and values. 6+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. 4+ years of experience running large scale cloud services. These requirements include, but are not limited to the following specialized security screenings: 2+ years of operational experience in improving Service Reliability, Availability and Performance. Understanding of Observability and MELT implementation patterns for large-scale services. Experience in Logic Apps and authoring Jupyter Notebooks. Experience in analyzing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems. Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity. Ability to deal with the ambiguity associated with working in a fast-paced environment. Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.

Responsibilities

Collaborate with engineering teams to enhance tooling and automation solutions for issue resolution and incident prevention. Interface with enterprise customers to handle service escalations and improve customer experience through proactive alerting and operational insights.