Principal Applied Scientist at Microsoft

Redmond, Washington, United States -

Full Time

Start Date

Immediate

Expiry Date

20 Feb, 26

Salary

296400.0

Posted On

22 Nov, 25

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Machine Learning, Statistics, Measurement Science, LLM Evaluation, Agent Evaluation, Programming, Python, Distributed Systems, A/B Testing, Human-in-the-loop, Safety Assessment, Robustness, Anomaly Detection, Causal Impact, Collaboration, Technical Leadership

Industry

Software Development

Description

- Lead end-to-end science for evaluating LLM-powered agents on real-time and batch workloads: designing evaluation frameworks, metrics, and pipelines that capture planning quality, tool use, retrieval, safety, and end-user outcomes, and partnering with engineering for robust, low-latency deployment. - Advance evaluation methodologies for agents across RTI surfaces by driving test set design, auto-raters (including LLM-as-judge), human-in-the-loop feedback loops, and measurable lifts in key quality metrics such as task success rate, reliability, and safety. - Establish rigorous evaluation and reliability practices for LLM/agent systems: from offline benchmarks and scenario-based evals to online experiments and production monitoring, defining guardrails and policies that balance quality, cost, and latency at scale. - Collaborate with PM, Engineering, and UX to translate evaluation insights into customer-visible improvements, shaping product requirements, de-risking launches, and iterating quickly based on telemetry, user feedback, and real-world failure modes. - Provide technical leadership and mentorship within the applied science and engineering community, fostering inclusive, responsible-AI practices in agent evaluation, and influencing roadmap, platform investments, and cross-team evaluation strategy across Fabric. - Bachelor's Degree in Statistics, Computer Science, Electrical or Computer Engineering, or related field AND 8+ years related experience OR Master's Degree in Statistics, Computer Science, Electrical or Computer Engineering, or related field AND 6+ years related experience OR Doctorate in Statistics, Computer Science, Electrical or Computer Engineering, or related field AND 5+ years related experience - 2+ years designing and running ML/LLM evaluation and experimentation (offline metrics + online A/B tests) -Proven experience applying machine learning, statistics, and measurement science to LLM and agent evaluation, ideally in real-time or streaming scenarios. - Proficiency in agentic AI concepts (e.g., multi-step agents, tool orchestration, retrieval/RAG, workflow automation) and familiarity with techniques for assessing safety, robustness, anomaly detection, and causal impact of agent behaviors. - Strong programming and modeling skills in languages such as Python, and experience building evaluation services or pipelines on distributed systems (e.g., running large-scale offline evals, auto-raters, or LLM-as-judge workloads). - Ability to design, implement, and interpret rigorous evaluations end-to-end: constructing eval sets and scenarios, combining offline metrics with human/LLM raters, running online experiments (A/B tests, holdouts), and instrumenting reliability monitoring at scale. - Collaborative mindset with demonstrated success partnering across Engineering, PM, and UX to define quality bars, translate evaluation insights into roadmap decisions, and iterate quickly on customer-facing agent and LLM experiences. azdat azuredata Applied Sciences IC6 - The typical base pay range for this role across the U.S. is USD $163,000 - $296,400 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $220,800 - $331,200 per year. Certain roles may be eligible for benefits and other compensation.

Responsibilities

Lead the evaluation of LLM-powered agents through designing frameworks and metrics, while collaborating with engineering for deployment. Establish rigorous evaluation practices and translate insights into product improvements.