AI Cloud Infrastructure Engineer at Liquid Tech Pty Ltd
South Africa, , South Africa -
Full Time


Start Date

Immediate

Expiry Date

13 Sep, 25

Salary

0.0

Posted On

15 Jun, 25

Experience

4 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Patch Management, Readiness Assessments, Change Process, Cloud Services, Iaas, Mttr, New Set Ups, Security, Paas

Industry

Information Technology/IT

Description

Collaborate with AI engineers and data scientists to understand and fulfil their infrastructure requirements accurately. Troubleshoot and resolve infrastructure-related issues quickly to maintain high service quality and user satisfaction. Stay current with advancements in cloud technologies, GPU computing, and AI infrastructure best practices continuously. Manage the assurance of Cloud services through: Ensuring a high-resolution efficiency for incidents- Maintain architecture deployment standards- Comply with all Process, Patch Management, Change Process, for any deployed and managed Cloud services- Produce reports to measure compliance to SLA and MTTR for customer managed cloud services- Conduct customer cloud readiness assessments and reports the outcome- Deploy customer solutions based on best practices including migrations, cloud networking and new set-ups

  • Monitor resources, networks, security, IaaS, PaaS, Saa
Responsibilities

Design and implement scalable cloud infrastructure tailored specifically for AI Factory workloads, leveraging NVIDIA GPUs comprehensively i.e. H200 GPUs.Optimise resource utilisation to balance performance and cost-efficiency for AI model training and inference tasks. Implement automation and orchestration tools, such as Terraform or Ansible, to manage infrastructure efficiently and reliably. Monitor system performance, availability, and reliability, addressing issues proactively to minimise downtime and disruptions.

Collaborate with AI engineers and data scientists to understand and fulfil their infrastructure requirements accurately. Troubleshoot and resolve infrastructure-related issues quickly to maintain high service quality and user satisfaction. Stay current with advancements in cloud technologies, GPU computing, and AI infrastructure best practices continuously. Manage the assurance of Cloud services through: Ensuring a high-resolution efficiency for incidents- Maintain architecture deployment standards- Comply with all Process, Patch Management, Change Process, for any deployed and managed Cloud services- Produce reports to measure compliance to SLA and MTTR for customer managed cloud services- Conduct customer cloud readiness assessments and reports the outcome- Deploy customer solutions based on best practices including migrations, cloud networking and new set-ups

  • Monitor resources, networks, security, IaaS, PaaS, SaaS

Document infrastructure designs, configurations, and operational procedures for knowledge sharing and regulatory compliance. Support in creating sales and marketing collateral and vertical Playbooks for AI cloud services. Support non-standard/complex solutions - Cross functional engagements, Sales support, Service., Products, Bid Office & engagement with Engineering. Training for sales, customer employees and product teams. Assist with technology selection and vendor selection for product management/ideation and eco-system enablement.
Leading the implementation and deployment team and providing technical insight and resolving technical dependencies and escalates where necessary. Maintaining Cloud services along agreed best practices and Service Level Agreements. Maintaining an up-to-date skillset that translates into secure, practical implementations and service delivery for projects
A bachelor’s or master’s degree in computer science, Engineering, or a related technical field is required for this role.
Familiarity with NVIDIA AI Enterprise software and tools to enhance GPU utilisation for AI workloads.
Relevant certifications in AI infrastructure management or related fields.
At least 4 years’ experience in cloud infrastructure engineering, preferably focused on AI or high-performance computing, is essential.
Expertise in cloud platforms like AWS, GCP, or Azure, and tools like Terraform or CloudFormation, is critical.
In-depth knowledge of NVIDIA GPUs, including the H200 series, and their application in AI workloads is necessary.
Experience with containerization and orchestration technologies, such as Docker and Kubernetes, is a key requirement.
Strong scripting skills in languages like Python or Bash for automation and infrastructure management are essential. Strong knowledge of data science, analytics, and applied mathematics, with expertise in languages like SAS, R, and Python.
Background in AI or machine learning, with a grasp of infrastructure needs
Experience with high-performance computing environments and related technologies strengthens suitability notably.
GPU technologies like NVIDIA H200 GPUs or similar high-performance hardware
High performance compute and storage
Virtualization and digitization of compute and networking
Strong knowledge of data science, analytics, and applied mathematics, with expertise in languages like SAS, R, and Python.
Machine Learning and strong general programming skill

Loading...