Kubernetes Engineer at Radiant Digital
Dallas, TX 75202, USA -
Full Time


Start Date

Immediate

Expiry Date

14 Nov, 25

Salary

0.0

Posted On

14 Aug, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Good communication skills

Industry

Information Technology/IT

Description

At Radiant Digital, we provide IT solutions and consulting services to help government agencies and businesses in the USA, Canada, the Middle East, and Southeast Asia. On the federal side, we support agencies like NASA, the Department of State (DOS), the IRS, ACL, ACF,USDA and many others, along with numerous state and local government agencies.
We work with industries like telecom, healthcare, entertainment, oil and gas offering solutions designed to meet their specific needs. We focus on improving systems, making better use of data, and updating applications to keep up with changing markets.

JOB DESCRIPTION:

In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments.
You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.

Responsibilities
  • Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
  • Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
  • Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
  • Optimising GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
  • Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
  • Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
  • Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
  • Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
  • Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
  • Participating in performance tuning, incident response and production readiness reviews
    Flexible work from home options available
Loading...