Senior ML Software Engineer - Quantization & Numerics at Microsoft
Redmond, Washington, United States -
Full Time


Start Date

Immediate

Expiry Date

23 Feb, 26

Salary

0.0

Posted On

25 Nov, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Model Optimization, Quantization, Performance Optimization, Deep Learning Frameworks, GPU Kernel Development, Low-Precision Numerics, Transformer Architectures, Large-Scale Model Evaluation, Python, C, C++, CUDA, Triton, ROCm, Model Compression, BLAS Kernels

Industry

Software Development

Description
Drive software development and model optimization tooling proof-of-concept effort to streamline deployment of quantized models. Analyze performance bottlenecks in quantized state-of-the-art LLM architectures and drive performance improvements. Prototype and evaluate emerging low-precision data formats through proof-of-concept implementations on novel hardware accelerator SDK. Co-design model architecture optimized for low-precision deployment in close collaboration with companywide AI/ML teams. Work cross-functionally with data scientists and ML researchers/engineers across organizations to align on model accuracy and performance goals. Partner with hardware architecture and AI software framework teams to ensure end-to-end system efficiency. Bachelor's Degree in Computer Science, Electrical or Computer Engineering, or related field AND 4+ years of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development OR Master's Degree in Computer Science, Electrical or Computer Engineering, or related field AND 3+ years of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development OR Doctorate in Computer Science, Electrical or Computer Engineering, or related field AND 1+ year(s) of industry experience in high-performance ML systems, GPU kernel development, or ML runtime/infrastructure development. Demonstrated experience delivering production-grade software in areas such as model compression, low-precision numerics (FP8, INT8/4, NVFP4, MX formats, etc.), low-level kernel development, and performance optimization. Proficiency with modern deep learning frameworks, including PyTorch, TensorFlow, TensorRT, and ONNX Runtime. Expertise in GPU/NPU kernel development using CUDA, Triton, ROCm, or comparable frameworks and fast model bring up on a new stack Strong understanding of Transformer and LLM architectures, with hands-on experience in optimization techniques such as quantization, pruning, tensor/parameter sharding, model parallelism, KV-cache optimization, and Flash Attention etc. Practical experience with large-scale model evaluation, including benchmarking state-of-the-art LLMs and fine-tuning (SFT or RL) large models. Solid programming skills in Python, C, and C++. Excellent communication abilities and a proven capacity to collaborate effectively in hybrid team-oriented environments. Hands-on experience implementing and optimizing low-level linear algebra routines, including custom BLAS kernels would be a plus. Deep knowledge of mixed-precision arithmetic units, including numerical formats and microarchitecture, is highly desirable.
Responsibilities
Drive software development and model optimization tooling proof-of-concept efforts to streamline deployment of quantized models. Analyze performance bottlenecks in quantized state-of-the-art LLM architectures and drive performance improvements.
Loading...