Senior Researcher - LLM Systems at Microsoft

Bengaluru, karnataka, India -

Full Time

Start Date

Immediate

Expiry Date

17 Feb, 26

Salary

0.0

Posted On

19 Nov, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Algorithm Development, Dynamic Batching, Routing, Scheduling, Caching Layers, GPU Utilization, Endpoint Configuration, Performance Optimization, Collaboration, Research Publication, Mentoring, C++, Python, Transformer Inference, Memory Management, Inference Serving Frameworks

Industry

Software Development

Description

- Invent and evaluate algorithms for dynamic batching, routing, and scheduling for transformer inference under multi-tenant SLOs and variable sequence lengths. - Design and implement caching layers (e.g., KV cache paging/offload, prompt/result caching) and memory pressure controls to maximize GPU/accelerator utilization. - Develop endpoint configuration policies (e.g., tensor/pipe parallelism, quantization/precision profiles, speculative decoding, chunked/streaming generation) and safe rollout mechanisms. - Profile and optimize end-to-end serving pipelines: token-level latency, E2E p95/p99, throughput-per-$, cold-start behavior, warm pool strategy, and capacity planning. - Collaborate with model, kernel, and hardware teams to align serving algorithms with attention/KV innovations and accelerator features. - Publish research, file patents, and, where appropriate, contribute to open-source serving frameworks. - Document designs, benchmarks, and operational playbooks; mentor junior researchers/engineers. Doctorate in relevant field - OR equivalent experience. Demonstrated expertise in queuing/scheduling theory and practical request orchestration under SLO constraints. Proficiency in C++ and Python for high-performance systems; strong code quality and profiling/debugging skills. Proven record of research impact (publications and/or patents) and shipping systems that run at scale. Deep understanding of transformer inference efficiency techniques (attention, paged KV cache, speculative decoding, LoRA, sequence packing/continuous batching, quantization). Background in cost/performance modeling, autoscaling, and multi-region DR. Hands-on experience with inference serving frameworks (e.g., vLLM, Triton Inference Server, TensorRT-LLM, ONNX Runtime/ORT, Ray Serve, DeepSpeed-MII). Familiarity with GPU/accelerator memory management concepts to co-design cache/throughput policies.

Responsibilities

The Senior Researcher will invent and evaluate algorithms for transformer inference and design caching layers to maximize GPU utilization. They will also collaborate with various teams and publish research while mentoring junior staff.