Start Date
Immediate
Expiry Date
13 Sep, 25
Salary
0.0
Posted On
15 Jun, 25
Experience
4 year(s) or above
Remote Job
Yes
Telecommute
Yes
Sponsor Visa
No
Skills
Patch Management, Readiness Assessments, Change Process, Cloud Services, Iaas, Mttr, New Set Ups, Security, Paas
Industry
Information Technology/IT
Collaborate with AI engineers and data scientists to understand and fulfil their infrastructure requirements accurately. Troubleshoot and resolve infrastructure-related issues quickly to maintain high service quality and user satisfaction. Stay current with advancements in cloud technologies, GPU computing, and AI infrastructure best practices continuously. Manage the assurance of Cloud services through: Ensuring a high-resolution efficiency for incidents- Maintain architecture deployment standards- Comply with all Process, Patch Management, Change Process, for any deployed and managed Cloud services- Produce reports to measure compliance to SLA and MTTR for customer managed cloud services- Conduct customer cloud readiness assessments and reports the outcome- Deploy customer solutions based on best practices including migrations, cloud networking and new set-ups
Design and implement scalable cloud infrastructure tailored specifically for AI Factory workloads, leveraging NVIDIA GPUs comprehensively i.e. H200 GPUs.Optimise resource utilisation to balance performance and cost-efficiency for AI model training and inference tasks. Implement automation and orchestration tools, such as Terraform or Ansible, to manage infrastructure efficiently and reliably. Monitor system performance, availability, and reliability, addressing issues proactively to minimise downtime and disruptions.
Collaborate with AI engineers and data scientists to understand and fulfil their infrastructure requirements accurately. Troubleshoot and resolve infrastructure-related issues quickly to maintain high service quality and user satisfaction. Stay current with advancements in cloud technologies, GPU computing, and AI infrastructure best practices continuously. Manage the assurance of Cloud services through: Ensuring a high-resolution efficiency for incidents- Maintain architecture deployment standards- Comply with all Process, Patch Management, Change Process, for any deployed and managed Cloud services- Produce reports to measure compliance to SLA and MTTR for customer managed cloud services- Conduct customer cloud readiness assessments and reports the outcome- Deploy customer solutions based on best practices including migrations, cloud networking and new set-ups
Document infrastructure designs, configurations, and operational procedures for knowledge sharing and regulatory compliance. Support in creating sales and marketing collateral and vertical Playbooks for AI cloud services. Support non-standard/complex solutions - Cross functional engagements, Sales support, Service., Products, Bid Office & engagement with Engineering. Training for sales, customer employees and product teams. Assist with technology selection and vendor selection for product management/ideation and eco-system enablement.
Leading the implementation and deployment team and providing technical insight and resolving technical dependencies and escalates where necessary. Maintaining Cloud services along agreed best practices and Service Level Agreements. Maintaining an up-to-date skillset that translates into secure, practical implementations and service delivery for projects
A bachelor’s or master’s degree in computer science, Engineering, or a related technical field is required for this role.
Familiarity with NVIDIA AI Enterprise software and tools to enhance GPU utilisation for AI workloads.
Relevant certifications in AI infrastructure management or related fields.
At least 4 years’ experience in cloud infrastructure engineering, preferably focused on AI or high-performance computing, is essential.
Expertise in cloud platforms like AWS, GCP, or Azure, and tools like Terraform or CloudFormation, is critical.
In-depth knowledge of NVIDIA GPUs, including the H200 series, and their application in AI workloads is necessary.
Experience with containerization and orchestration technologies, such as Docker and Kubernetes, is a key requirement.
Strong scripting skills in languages like Python or Bash for automation and infrastructure management are essential. Strong knowledge of data science, analytics, and applied mathematics, with expertise in languages like SAS, R, and Python.
Background in AI or machine learning, with a grasp of infrastructure needs
Experience with high-performance computing environments and related technologies strengthens suitability notably.
GPU technologies like NVIDIA H200 GPUs or similar high-performance hardware
High performance compute and storage
Virtualization and digitization of compute and networking
Strong knowledge of data science, analytics, and applied mathematics, with expertise in languages like SAS, R, and Python.
Machine Learning and strong general programming skill