Sr Technical Project Manager (AI Token Factory) at Nebius

Amsterdam, North Holland, Netherlands -

Full Time

Start Date

Immediate

Expiry Date

16 Mar, 26

Salary

0.0

Posted On

16 Dec, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Technical Project Management, Cloud Platforms, Distributed Systems, Production Infrastructure, Kubernetes, Service Reliability, Observability, Data Analysis, SQL, Python, Scripting, Cross-Team Coordination, Risk Management, Incident Response, Capacity Planning, Autoscaling, LLM Serving

Industry

technology;Information and Internet

Description

Why work at Nebius Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside some of the most experienced and innovative leaders and engineers in the field. Where we work Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team of over 800 employees includes more than 400 highly skilled engineers with deep expertise across hardware and software engineering, as well as an in-house AI R&D team. About the Team The TokenFactory team is building a high-performance inference platform for large-scale, production use of LLMs. Our mission is to make powerful models easy to consume via stable APIs while meeting strict requirements on latency, reliability, and cost efficiency. The platform runs GPU-intensive workloads at scale and integrates deeply with the Nebius Cloud infrastructure, networking, observability, and capacity planning systems. We are looking for a Technical Program Manager (TPM) to drive cross-team execution for the inference platform as it scales in usage, regions, and complexity. What You’ll Do As a TPM for the AI Studio inference platform, you will own end-to-end delivery of complex, cross-functional initiatives that span infrastructure, platform engineering, hardware, and customer-facing teams. You will: Drive large, cross-team programs related to platform scaling, reliability, performance, and cost efficiency Coordinate work across AI Studio engineers, Cloud Platform and Observability teams Translate product and customer requirements (latency, throughput, SLAs, cost) into executable technical plans Define clear scope, milestones, dependencies, and success metrics for multi-quarter initiatives Unblock teams by driving decisions on architecture trade-offs, rollout strategies, and operational processes Track and communicate risks, incidents, and dependencies to stakeholders at both engineering and leadership levels Introduce and scale repeatable processes for launches, capacity planning, incident reviews, and platform changes Support execution around model rollouts, autoscaling changes, GPU capacity expansion, and regional launches What We Expect 5+ years of experience as a TPM (or equivalent role) leading cross-team technical programs Strong technical foundation in cloud platforms, distributed systems, and production infrastructure Practical understanding of Kubernetes-based platforms, service reliability, and observability (metrics, logs, traces) Experience driving execution where you influence without formal authority Ability to reason about system-level trade-offs (latency vs cost, reliability vs utilization) Strong written and verbal communication skills; comfortable working with engineers and senior stakeholders Analytical mindset with hands-on experience using data (SQL, Python, or scripting) to track progress and inform decisions Nice to Have / Ways to Stand Out Prior background as a Software Engineer, SRE, or Systems Engineer Experience working with GPU-based workloads or high-throughput inference systems Familiarity with LLM serving stacks (e.g. vLLM, TRTLLM) or ML platform environments Experience running programs tied to capacity planning, autoscaling, or multi-region deployments Exposure to environments operating under strict SLOs / SLAs and fast incident response loops What we offer Competitive salary and comprehensive benefits package. Opportunities for professional growth within Nebius. Flexible working arrangements. A dynamic and collaborative work environment that values initiative and innovation. We’re growing and expanding our products every day. If you’re up to the challenge and are excited about AI and ML as much as we are, join us!

Responsibilities

As a Technical Program Manager, you will own the end-to-end delivery of complex, cross-functional initiatives related to the AI Studio inference platform. This includes driving large programs for platform scaling, reliability, performance, and cost efficiency.