Head of ML Cloud Platform at UniversalAGI
, , United States -
Full Time


Start Date

Immediate

Expiry Date

28 Feb, 26

Salary

0.0

Posted On

30 Nov, 25

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

ML Infrastructure, Cloud Platform Engineering, Distributed Training, Python, Kubernetes, Docker, Team Building, Customer Engagement, System Design, Performance Optimization, Security Compliance, Technical Leadership, Communication, Execution Velocity, Deep Learning, CAE Integration

Industry

Research Services

Description
📍 San Francisco | Work Directly with CEO & founding team | Report to CEO | OpenAI for Physics | 🏢 5 Days Onsite Head of ML Cloud Platform 📍 San Francisco | Work Directly with CEO & Founding Team | Report to CEO | OpenAI for Physics | 🏢 5 Days Onsite Location: Onsite in San Francisco Compensation: Competitive Salary + Significant Equity Who We Are UniversalAGI is building OpenAI for Physics. AI startup based in San Francisco and backed by Elad Gil (#1 Solo VC), Eric Schmidt (former Google CEO), Prith Banerjee (ANSYS CTO), Ion Stoica (Databricks Founder), Jared Kushner (former Senior Advisor to the President), David Patterson (Turing Award Winner), and Luis Videgaray (former Foreign and Finance Minister of Mexico). We're building foundation AI models for physics that enable end-to-end industrial automation from initial design through optimization, validation, and production. We're building a high-velocity team of relentless researchers and engineers that will define the next generation of AI for industrial engineering. If you're passionate about AI, physics, or the future of industrial innovation, we want to hear from you. About the Role As the Head of ML Cloud Platform, you'll be in the arena from day one, building and leading the team that creates the backbone for AI-powered physics simulation at scale. This is your chance to own the entire ML infrastructure vision—from training foundation models on petabytes of CFD data to deploying them into mission-critical automotive and maritime production environments. You'll work directly with the CEO and founding team to build a world-class ML platform organization, recruiting exceptional engineers and researchers while remaining deeply technical yourself. You'll architect systems that train models faster, serve predictions with lower latency, and integrate seamlessly into customers' existing CAE workflows—all while managing a team that ships with the velocity of a startup and the rigor of enterprise infrastructure. This isn't a pure management role. You're a technical leader who codes, debugs production incidents at 2 AM when needed, and earns respect through hands-on contribution while simultaneously building the team and culture that will scale our platform to serve the world's largest industrial companies. What You'll Do Technical Leadership & Architecture Define the ML platform vision: Architect the end-to-end infrastructure strategy for training, fine-tuning, serving, and deploying foundation models for physics simulation across cloud and on-premise environments Build for scale and reliability: Design systems that can handle petabyte-scale CFD datasets, multi-day distributed training runs, and real-time inference for customers making million-dollar engineering decisions Stay hands-on: Write code, debug critical production issues, review pull requests, and make key architectural decisions yourself—you're a technical leader who leads by doing Bridge research and production: Translate cutting-edge research from our deep learning team into production-grade infrastructure that customers can depend on Integrate with CAE ecosystems: Ensure our platform works seamlessly with existing simulation tools (Ansys, OpenFOAM, STAR-CCM+), HPC clusters, PLM systems, and enterprise security requirements Team Building & Management Recruit world-class talent: Build a team of exceptional ML infrastructure engineers, cloud platform engineers, and MLOps specialists who can execute at the highest level Develop and mentor: Coach engineers to grow technically and professionally, fostering a culture of deep work, technical excellence, and customer obsession Scale the organization: Grow the team from founding engineers to a robust platform organization as we scale from early customers to enterprise deployments Set technical standards: Establish engineering practices, code review processes, and quality bars that enable the team to ship fast without breaking things Foster collaboration: Work closely with deep learning researchers, product engineers, CFD domain experts, and customer success to ensure platform capabilities align with company needs Execution & Delivery Ship relentlessly: Drive the team to deliver infrastructure from prototype to production in weeks, not quarters, iterating based on real customer feedback Own reliability: Take responsibility for platform uptime, performance, and customer success—when things break, you're in the arena fixing them Make strategic tradeoffs: Balance innovation with stability, speed with quality, and custom solutions with scalable platforms Work with customers: Engage directly with automotive and maritime customers to understand their infrastructure requirements, security constraints, and deployment challenges Build for enterprise: Implement security, compliance, monitoring, and operational practices that meet the standards of Fortune 500 companies Qualifications Required Experience 8+ years in ML infrastructure or cloud platform engineering, with at least 3 years in technical leadership roles managing high-performing teams Proven track record building and scaling ML platforms for training, serving, or deploying models in production environments, ideally at AI-first companies Deep technical expertise in distributed training (PyTorch Distributed, DeepSpeed, Ray), cloud infrastructure (AWS/GCP/Azure), and container orchestration (Kubernetes, Docker) Hands-on coding ability: Expert-level Python and infrastructure-as-code skills—you can still ship production code yourself and review your team's work deeply Team building success: Track record of recruiting, developing, and retaining exceptional engineering talent, with experience building teams from 3-4 engineers to 15-20+ Strong product and customer intuition: Experience working closely with customers, understanding their workflows, and translating requirements into technical solutions Outstanding execution velocity: Proven ability to ship infrastructure rapidly in fast-paced, high-growth environments while maintaining quality Technical Requirements ML infrastructure mastery: Deep understanding of training pipelines, model serving, distributed systems, GPU optimization, and the full ML lifecycle Cloud platform expertise: Strong experience with cloud providers, infrastructure-as-code tools, and building hybrid cloud/on-premise solutions System design excellence: Can architect complex, scalable systems and make smart tradeoff decisions under uncertainty Performance optimization: Knowledge of GPU programming, model optimization techniques, and infrastructure cost management Enterprise infrastructure: Experience with security, compliance, SSO, RBAC, and deploying into regulated or air-gapped environments Leadership & Communication Technical credibility: Earns respect through deep technical contribution, not just title or tenure Clear communicator: Can explain complex technical decisions to customers, executives, researchers, and engineers at all levels Strategic thinker: Balances short-term execution with long-term platform vision and architectural decisions Player-coach mentality: Comfortable coding and debugging yourself while also managing, mentoring, and growing a team High agency: Takes ownership of outcomes, doesn't wait for permission, and drives solutions to completion Bonus Qualifications Experience in industrial or scientific ML: Built infrastructure for physics simulation, computational chemistry, drug discovery, or other scientific computing domains CAE/HPC background: Familiarity with simulation software, job schedulers (SLURM, PBS), parallel file systems, or high-performance computing environments Founded or led platform teams at AI startups (Seed to Series B) through rapid growth and scaling challenges Published or presented on ML infrastructure, distributed training, or MLOps topics at major conferences or venues Experience with foundation models: Built infrastructure for training or serving large-scale pretrained models (LLMs, vision models, multimodal models) Open-source contributions to major ML infrastructure projects (PyTorch, Ray, Kubernetes, MLflow, etc.) PhD or MS in Computer Science, ML, or related field (or equivalent industry experience) Enterprise B2B experience: Sold to or deployed infrastructure for Fortune 500 customers with complex security and compliance requirements Cultural Fit Technical Respect: Ability to earn respect through hands-on technical contribution, not just management authority Intensity: Thrives in our unusually intense culture—willing to grind when needed and expects the same from your team Customer Obsession: Passionate about solving real customer problems and building infrastructure that enables their success Deep Work: Values long, uninterrupted periods of focused work and fosters this culture in your team High Availability: Ready to be deeply involved whenever critical issues arise, whether that's at 2 AM or on weekends Communication: Can translate complex technical concepts to diverse audiences and bridge engineering, research, and business Growth Mindset: Embraces continuous learning and develops this mindset in your team Startup Mindset: Comfortable with ambiguity, rapid change, and wearing multiple hats—you're a builder first, manager second Work Ethic: Willing to put in the extra hours when needed to hit critical milestones and holds your team to high standards Low Ego, High Accountability: Collaborative leadership style with focus on outcomes over personal credit What We Offer Build the foundation: Shape the ML platform strategy for a rapidly growing foundational AI company from the ground up Real-world impact: See your infrastructure power physics simulations that optimize automotive aerodynamics, maritime vessel design, and other critical engineering applications Direct CEO collaboration: Work closely with the founder & CEO, influence company strategy, and have your voice heard on major decisions Exceptional team: Recruit and work with world-class deep learning researchers, CFD experts, and infrastructure engineers Competitive compensation: Base salary + significant equity upside as a founding leadership hire In-person culture: 5 days a week in office with a team that values face-to-face collaboration, deep technical discussions, and building together World-class network: Access to our investors and advisors including Eric Schmidt, Elad Gil, Ion Stoica, David Patterson, and others Benefits Competitive compensation and equity Competitive health, dental, vision benefits paid by the company 401(k) plan offering Flexible vacation Team Building & Fun Activities Great scope, ownership and impact AI tools stipend Monthly commute stipend Monthly wellness / fitness stipend Daily office lunch & dinner covered by the company Immigration support How We're Different "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again... who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly." - Teddy Roosevelt At our core, we believe in being "in the arena." We are builders, problem solvers, and risk-takers who show up every day ready to put in the work: to sweat, to struggle, and to push past our limits. We know that real progress comes with missteps, iteration, and resilience. We embrace that journey fully knowing that daring greatly is the only way to create something truly meaningful. If you're ready to build the ML platform that will revolutionize physics simulation, lead a world-class team, and deliver transformative impact to industrial engineering, UniversalAGI is the place for you.
Responsibilities
As the Head of ML Cloud Platform, you will lead the team responsible for building the infrastructure for AI-powered physics simulation. You will define the ML platform vision and ensure the systems are scalable and reliable while remaining hands-on in coding and debugging.
Loading...