Principal Network Architect at Microsoft

Redmond, Washington, United States -

Full Time

Start Date

Immediate

Expiry Date

20 Feb, 26

Salary

0.0

Posted On

22 Nov, 25

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Network Architecture, Congestion Control, Telemetry, Quality Of Service, Performance Management, Automation, Traffic Generation, Switch Architecture, Optics, AI Workloads, Python, Go, Ansible, Distributed Training, Buffer Management, Routing

Industry

Software Development

Description

Own end-to-end network architecture for AI training/inference clusters: topology, routing, transport, congestion control, QoS, telemetry, reliability, and failure domains. Lead and grow a high-performing team (~10 engineers) across architecture, performance, and validation; set goals, mentor, and drive execution. Define scale-out/scale-up designs (e.g., leaf-spine, dragonfly/dragonfly+, Clos/fat-tree, 2D/3D torus variants) and network services for job schedulers and accelerator runtimes. Drive congestion-control strategy (ECN/PFC, DCQCN, HPCC, TIMELY, HULL, adaptive load balancing like CONGA/HULA) and transport tuning (RDMA/RoCEv2, QUIC/TCP variants) for tail-latency and throughput SLAs. Hands-on analysis of switch/NIC behavior using counters, traces, and telemetry (PFC/ECN stats, INT, in-band telemetry, gNMI/gNOI, sFlow/NetFlow, eBPF); create reproducible perf tests. Evaluate and influence silicon & optics (ASIC feature roadmaps, queueing/scheduling, packet recirculation, shared buffer, VOQs, cut-through vs store-and-forward, 400/800G, linear vs retimed optics). Prototype and validate in lab and pre-prod: build testbeds, craft microbenchmarks and realistic AI workloads; automate with Python/Go/Ansible; codify SLOs and pass/fail gates. Partner across teams (accelerator/HBM, storage, orchestration, reliability) to co-design network-aware collective ops (all-reduce/all-to-all/mixture-of-experts) and placement policies. Influence standards and industry direction via active participation in IEEE 802.3/802.1, IETF, OCP, OIF, Ethernet Alliance, and vendor ecosystems; drive MSFT requirements into roadmaps. Operational excellence: define observability, fault isolation, failure testing (Jepsen-style chaos, link flap/black-hole, incast), capacity planning, and upgrade/rollout strategies. Documentation & reviews: author design docs, RFCs, and executive briefs; run design and readiness reviews. Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 9+ years technical engineering experience OR Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 11+ years technical engineering experience OR equivalent experience. 10+ years designing and operating large-scale L2/L3 Ethernet fabrics for HPC/AI or hyperscale services. 5+ years of experience with Ethernet, RDMA/RoCEv2, congestion control (ECN/PFC, DCQCN, HPCC, TIMELY), routing (BGP/ECMP, IS-IS/OSPF), and load balancing (CONGA/HULA/PLB). 5+ years of experience with of switch/NIC architecture (ASIC pipelines, queueing/scheduling, buffers, telemetry, hash/ECMP behaviors) and optics (DR/FR/LR, PAM-4, FEC). 5+ years of experience with traffic generation and analysis (ixia/Keysight, TRex, pktgen, iperf, perfetto), switch/NIC telemetry, and packet capture (INT, ERSPAN, SPAN, pcaps). 3+ years of experience managing engineers (hiring, mentoring, performance management, org health). Experience optimizing networks for AI collectives (all-reduce, all-gather, expert routing) and distributed training systems. Familiarity with programmable data planes (P4, eBPF/XDP), in-network telemetry/compute, and NIC offloads (GRO/TSO/LRO, DPDK). Depth in buffer management and queue disciplines (DWRR, WFQ, Deficit Round Robin, QCN, VOQ) and QoS for multi-tenant clusters. Experience with optic/PHY roadmaps (800G/1.6T, linear pluggables, CPO/LPO, FEC trade-offs) and DC power/cooling constraints affecting network design. Contributions to standards bodies/consortia (drafts, presentations) and vendor co-development. Proven track record shipping production network designs with measurable latency/throughput improvements and reliability gains. Proficiency in Python/Go and automation frameworks (Ansible/Terraform) for test, measurement, and CI.

Responsibilities

Own end-to-end network architecture for AI training/inference clusters and lead a high-performing team of engineers. Drive congestion-control strategy and prototype network designs while partnering across teams for network-aware operations.