Senior DevOps Engineer at ByteMetrics LTD
Stoke-on-Trent ST1, England, United Kingdom -
Full Time


Start Date

Immediate

Expiry Date

29 Apr, 25

Salary

14400.0

Posted On

29 Jan, 25

Experience

9 year(s) or above

Remote Job

No

Telecommute

No

Sponsor Visa

No

Skills

Distributed Systems, Python, Infrastructure, Containerization, Code, Devops, Scripting, Docker, Logging, Orchestration, Cuda, Gitlab, Kubernetes, Bash, Jenkins

Industry

Information Technology/IT

Description

OVERVIEW:

We are seeking a Senior DevOps Engineer with expertise in designing and managing infrastructure tailored for AI/ML workloads. In this role, you will work at the intersection of cutting-edge AI technologies and DevOps practices to build and optimize platforms that empower AI-driven solutions. Collaborating with cross-functional teams, you will play a critical role in enabling scalable, secure, and high-performing AI pipelines and infrastructure.

REQUIRED SKILLS:

  • Cloud Expertise: Deep knowledge of cloud platforms (AWS, Azure, GCP), especially services tailored for AI/ML like SageMaker, Vertex AI, or Azure ML.
  • Containerization & Orchestration: Proficient in Docker and Kubernetes, with experience in managing large-scale clusters for AI/ML workloads.
  • CI/CD Proficiency: Strong experience in setting up CI/CD pipelines with tools like GitLab, Jenkins, or CircleCI, particularly for ML workflows.
  • Programming & Scripting: Proficient in Python, Bash, or similar languages, with experience in automating DevOps tasks and supporting AI frameworks (TensorFlow, PyTorch, etc.).
  • Monitoring & Logging: Hands-on experience with monitoring tools like Prometheus, Grafana, ELK stack, or Datadog for tracking AI model and infrastructure performance.
  • Distributed Systems: Familiarity with distributed training frameworks and technologies like Horovod, Ray, or Dask.
  • Infrastructure as Code (IaC): Expertise in Terraform, CloudFormation, or Ansible for automating and managing AI-focused infrastructure.
  • Networking: Strong knowledge of WAN/LAN technologies, VPC configurations, and networking setups optimized for AI workloads.

PREFERRED SKILLS:

  • Experience with MLOps tools like MLflow, Kubeflow, or Airflow.
  • Knowledge of AI/ML model versioning and governance.
  • Familiarity with GPU optimization, including NVIDIA tools like CUDA and cuDNN.
  • Exposure to ethical AI practices and explainability frameworks.
Responsibilities
  • AI Infrastructure Design: Develop and manage cloud-based, on-premises, and hybrid infrastructure optimized for AI/ML workloads, leveraging GPU clusters, Kubernetes, and distributed systems.
  • CI/CD for AI Pipelines: Automate and enhance CI/CD pipelines for training, testing, and deploying AI models into production environments.
  • Model Deployment & Monitoring: Implement scalable solutions for deploying AI models, including containerized services and serverless architectures, while monitoring model performance and accuracy in production.
  • Cloud Architecture: Design cost-effective and efficient cloud infrastructure using AWS, Azure, or Google Cloud tailored for AI operations.
  • Data Pipeline Integration: Collaborate with data engineers and scientists to manage large-scale data pipelines, ensuring seamless integration with AI systems.
  • Security & Compliance: Implement security best practices, including IAM policies, data encryption, and vulnerability management, ensuring compliance with industry standards.
  • Infrastructure Automation: Use Infrastructure as Code (IaC) tools like Terraform or CloudFormation to automate provisioning, scaling, and configuration of AI environments.
  • Collaboration: Work closely with development, operations, and AI/ML teams to drive innovation and deliver reliable AI solutions.
Loading...