HPC Engineer - Compute at World Wide Technology Healthcare Solutions
, , India -
Full Time


Start Date

Immediate

Expiry Date

19 May, 26

Salary

0.0

Posted On

18 Feb, 26

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Hpc, Ai Infrastructure, Cluster Provisioning, Nvidia Bcm, Hardware Architecture, Dgx/Hxg/Nvl72/Mgx, Slurm, Linux Administration, Ansible, Python, Bash, Git, InfiniBand, Rocev2, Docker, Kubernetes

Industry

IT Services and IT Consulting

Description
Technical Competencies Essential Skills HPC & AI Infrastructure: * Cluster Provisioning: Proficiency with NVIDIA Base Command Manager (BCM) for bare-metal provisioning and image management. * Hardware Architecture: Deep understanding of NVIDIA DGX/HGX/NVL72/MGX architectures, including PCIe topology, NVLink/NVSwitch connectivity, and GPU memory hierarchy. * Workload Management: ability to troubleshoot basic Slurm issues (job dependencies, partition misconfigurations, node draining). Linux & Automation: * Linux Administration: Solid mastery of RHEL/Ubuntu internals, including systemd, kernel modules, and package management. * Automation: Ability to read and execute Ansible playbooks and write basic Python/Bash scripts for task automation. * Version Control: Familiarity with Git workflows (pulling code, creating branches, committing config changes). Desirable Experience * Networking: Experience with high-speed interconnects (InfiniBand NDR/HDR, RoCEv2) and debugging connectivity issues. * Containerisation: Experience with Docker and Kubernetes (specifically the NVIDIA GPU Operator). * Cisco Integration: Familiarity with Cisco UCS or Cisco Nexus configurations in an AI context. Certifications Highly Desirable: * NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO) * NVIDIA-Certified Professional: AI Infrastructure (NCP-AII) * Red Hat Certified System Administrator (RHCSA) Success Metrics (KPIs) * Deployment Velocity: Achieving
Responsibilities
The engineer executes the physical and logical lifecycle of GPU compute fleets, focusing on hands-on configuration and validation of infrastructure like NVIDIA SuperPOD and Cisco AI Factory environments. Key tasks include automated bare-metal provisioning using BCM, executing IaC playbooks, performing complex firmware upgrades, and running performance benchmarks like HPL and NCCL-tests.
Loading...