Member of Technical Staff, Hardware Health at Microsoft

Redmond, Washington, United States -

Full Time

Start Date

Immediate

Expiry Date

24 Feb, 26

Salary

0.0

Posted On

26 Nov, 25

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Hardware Health Monitoring, Predictive Analytics, Telemetry, Power Data, Thermal Data, Root Cause Analysis, System Health KPIs, Incident Triage, Automation, Reliability, Thermal Efficiency, Serviceability, GPU Architecture, High-Speed Interconnects, Failure Analysis, Machine Learning

Industry

Software Development

Description

Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale). Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues. Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies. Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms. Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters. Drive automation in health management to reduce manual intervention to the top 5% of anomalies. Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability. Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience. Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent). Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies. Proficiency in hardware telemetry, diagnostics, or failure analysis tools. Experience with exascale-class systems or cloud-scale AI clusters. Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance. Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.

Responsibilities

Design and develop hardware health monitoring frameworks for large GPU clusters. Collaborate with engineers to identify and remediate hardware anomalies.