Principal Software Engineer, AI Model Serving at Red Hat Inc

Raleigh, NC 27601, USA -

Full Time

Start Date

Immediate

Expiry Date

07 Dec, 25

Salary

245050.0

Posted On

09 Sep, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Good communication skills

Industry

Information Technology/IT

Description

Are you ready to join a game-changing open-source AI platform that harnesses the power of hybrid cloud to drive innovation?
The Red Hat OpenShift AI (RHOAI) team is looking for a Principal Software Engineer with Kubernetes and MLOps (Machine Learning) experience to join our rapidly growing engineering team. Our focus is to create a platform, partner ecosystem, and community by which enterprise customers can solve problems to accelerate business success using AI. This is a very exciting opportunity to build and impact the next generation of hybrid cloud MLOps platforms, contribute to the development of the RHOAI product, participate in open-source communities, and be at the forefront of the exciting evolution of AI. You’ll join an ecosystem that fosters continuous learning, career growth, and professional development.
In this role, you’ll be contributing as a model serving and monitoring subject matter expert for the model serving features of the open-source
Open Data Hub
project by actively participating in
KServe
,
TrustyAI
,
Kubeflow
,
HuggingFace
,
vLLM
and several other open-source communities. You will work as part of an evolving development team to rapidly design, secure, build, test, and release model serving, trustworthy AI, and model registry capabilities. The role is primarily an individual contributor who will be a key notable contributor to MLOps upstream communities and collaborate closely with the internal cross-functional development teams.

What you will do

Lead the team strategy and implementation for Kubernetes-native components in Model Serving, including Custom Resources, Controllers, and Operators.
Be an influencer and leader in MLOps-related open source communities to help build an active MLOps open source ecosystem for Open Data Hub and OpenShift AI
Act as an MLOps SME within Red Hat by supporting customer-facing discussions, presenting at technical conferences, and evangelizing OpenShift AI within the internal community of practices
Architect and design new features for open-source MLOps communities such as KubeFlow and KServe
Provide technical vision and leadership on critical and high-impact projects
Mentor, influence, and coach a team of distributed engineers
Ensure non-functional requirements including security, resiliency, and maintainability are met
Write unit and integration tests and work with quality engineers to ensure product quality
Use CI/CD best practices to deliver solutions as productization efforts into RHOAI
Contribute to a culture of continuous improvement by sharing recommendations and technical knowledge with team members
Collaborate with product management, other engineering, and cross-functional teams to analyze and clarify business requirements
Communicate effectively to stakeholders and team members to ensure proper visibility of development efforts
Give thoughtful and prompt code reviews
Represent RHOAI in external engagements including industry events, customer meetings, and open-source communities
Proactively utilize AI-assisted development tools (e.g., GitHub Copilot, Cursor, Claude Code) for code generation, auto-completion, and intelligent suggestions to accelerate development cycles and enhance code quality.
Explore and experiment with emerging AI technologies relevant to software development, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.

What you will bring

Proven expertise with Kubernetes API development and testing (CRs, Operators, Controllers), including reconciliation logic.
Strong background with model serving (like KServe, vLLM) and distributed inference strategies for LLMs (tensor, pipeline, data parallelism).
Deep understanding of GPU optimization, autoscaling (KEDA/Knative), and low-latency networking (e.g., NVLink, P2P GPU).
Experience architecting resilient, secure, and observable systems for model serving, including metrics and tracing.
Advanced skills in Go and Python; ability to design APIs for high-performance inference and streaming.
Excellent system troubleshooting skills in cloud environments and the ability to innovate in fast-paced environments.
Strong communication and leadership skills to mentor teams and represent projects in open-source communities.
Autonomous work ethic and passion for staying at the forefront of AI and open source.

The following will be considered a plus:

An existing contributor in one or more MLOps open source projects such as KubeFlow, KServe, RayServe, and vLLM is a huge plus
Familiarity with optimization techniques for LLMs (quantization, TensorRT, Hugging Face Accelerate).
Knowledge of end-to-end MLOps workflows, including model registry, explainability, and drift detection.
Bachelor’s degree in statistics, mathematics, computer science, operations research, or a related quantitative field, or equivalent expertise; Master’s or PhD is a big plus
Understanding of how Open Source and Free Software communities work
Experience with development for public cloud services (AWS, GCE, Azure)
Experience in engineering, consulting or another field related to model serving and monitoring, model registry, explainable AI, deep neural networks, in a customer environment or supporting a data science team
Highly experienced in OpenShift
Familiarity with popular Python machine learning libraries such as PyTorch, Tensorflow, and Hugging Face

Responsibilities

Lead the team strategy and implementation for Kubernetes-native components in Model Serving, including Custom Resources, Controllers, and Operators.
Be an influencer and leader in MLOps-related open source communities to help build an active MLOps open source ecosystem for Open Data Hub and OpenShift AI
Act as an MLOps SME within Red Hat by supporting customer-facing discussions, presenting at technical conferences, and evangelizing OpenShift AI within the internal community of practices
Architect and design new features for open-source MLOps communities such as KubeFlow and KServe
Provide technical vision and leadership on critical and high-impact projects
Mentor, influence, and coach a team of distributed engineers
Ensure non-functional requirements including security, resiliency, and maintainability are met
Write unit and integration tests and work with quality engineers to ensure product quality
Use CI/CD best practices to deliver solutions as productization efforts into RHOAI
Contribute to a culture of continuous improvement by sharing recommendations and technical knowledge with team members
Collaborate with product management, other engineering, and cross-functional teams to analyze and clarify business requirements
Communicate effectively to stakeholders and team members to ensure proper visibility of development efforts
Give thoughtful and prompt code reviews
Represent RHOAI in external engagements including industry events, customer meetings, and open-source communities
Proactively utilize AI-assisted development tools (e.g., GitHub Copilot, Cursor, Claude Code) for code generation, auto-completion, and intelligent suggestions to accelerate development cycles and enhance code quality.
Explore and experiment with emerging AI technologies relevant to software development, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling