Lead GenAI Engineer at Apple

Seattle, Washington, United States -

Full Time

Start Date

Immediate

Expiry Date

18 May, 26

Salary

0.0

Posted On

17 Feb, 26

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

GenAI, LLM Inference Optimization, Distributed Training, GPU Utilization, TPU Infrastructure, Model Serving Platforms, vLLM, Ray, TensorRT-LLM, KV-Cache Strategies, Speculative Decoding, Tensor Parallelism, Python, C++, CUDA, ML Systems Engineering

Industry

Computers and Electronics Manufacturing

Description

Imagine what you could do here. At Apple, innovative ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish! As part of Apple Cloud AI, we are building the next generation of ML infrastructure that powers AI capabilities across Apple's products and services. Our team tackles some of the most challenging problems in the industry - optimizing LLM inference at massive scale, building distributed training systems that push the boundaries of GPU and TPU utilization, and architecting model serving platforms that deliver sub-millisecond latency for real-time AI experiences. You'll work with cutting-edge technologies including vLLM, Ray, TensorRT-LLM, TPU Infrastructure, and custom inference engines, while shaping how foundation models are trained, fine-tuned, and deployed across Apple's ecosystem. As a Lead GenAI/ML Engineer, you will architect high-performance ML systems from the ground up - designing efficient KV-cache strategies, implementing speculative decoding, optimizing tensor parallelism across GPU and TPU clusters, and building the infrastructure that brings Apple's most ambitious AI capabilities to life. DESCRIPTION This role requires translating cutting-edge ML research into production-ready systems that meet the demanding requirements of Apple's ML workloads. You will work closely with research teams to productionize new model architectures and optimization techniques. We are looking for candidates who thrive at the intersection of ML research and systems engineering - someone who can read a paper on FlashAttention or PagedAttention and implement a production-grade version, or who can profile a training job and identify opportunities to improve GPU utilization from 40% to 80%. MINIMUM QUALIFICATIONS 8+ years of experience in ML systems engineering, with at least 3 years focused on LLM/GenAI infrastructure Deep expertise in LLM inference optimization: KV-cache management, batching strategies, quantization, speculative decoding Strong proficiency in Python and C++/CUDA for performance-critical code Hands-on experience with inference frameworks: vLLM, TensorRT-LLM, Triton Inference Server, or equivalent Experience with distributed training at scale using frameworks like DeepSpeed, Megatron-LM, FSDP, or Ray Train Solid understanding of transformer architectures and attention mechanisms at the implementation level Experience optimizing ML workloads on NVIDIA GPUs (profiling, memory optimization, kernel tuning) Track record of taking ML systems from research/prototype to production at scale MS or PhD in Computer Science, Machine Learning, or equivalent practical experience PREFERRED QUALIFICATIONS Experience with TPU infrastructure (JAX/XLA, TPU training/serving optimization) Contributions to open-source ML infrastructure projects (vLLM, Ray, TensorRT-LLM, etc.) Experience with custom CUDA kernel development or Triton (OpenAI) Deep knowledge of model compression techniques: pruning, distillation, mixed-precision training Experience with multi-node training orchestration and fault tolerance Familiarity with emerging architectures: MoE models, linear attention variants, state-space models Experience building ML platforms serving high QPS with strict latency requirements

Responsibilities

This role involves architecting high-performance ML systems from the ground up, focusing on designing efficient KV-cache strategies, implementing speculative decoding, and optimizing tensor parallelism across GPU and TPU clusters. The engineer will translate cutting-edge ML research into production-ready systems that meet Apple's demanding ML workload requirements.