Devops Engineer - Machine Learning at CoMind

London, England, United Kingdom -

Full Time

Start Date

Immediate

Expiry Date

10 Mar, 25

Salary

0.0

Posted On

07 Nov, 24

Experience

0 year(s) or above

Remote Job

Telecommute

Sponsor Visa

Skills

Pipelines, Docker, Code, Version Control, Containerization, Integration Testing, Git, Bitbucket, Parallel Processing, Infrastructure

Industry

Information Technology/IT

Description

At CoMind, we are developing a non-invasive neuromonitoring technology that will result in a new era of clinical brain monitoring. In joining us, you will be helping to create cutting-edge technologies that will improve how we diagnose and treat brain disorders, ultimately improving and saving the lives of patients across the world.

SKILLS & EXPERIENCE:

Git or Bitbucket for version control, including experience with managing versioned infrastructure-as-code (IaC) repositories
CI/CD pipelines for automating workflows, including experience with integration testing and containerization pipelines
Experience managing and orchestrating complex cloud workflows (e.g., ECS Tasks, AWS Batch), with a focus on event-driven and parallel processing
Infrastructure as Code (IaC) experience (e.g., Terraform, AWS CloudFormation) for creating, maintaining, and scaling cloud infrastructure
Docker for containerization, including experience with containerizing machine learning workflows and publishing containers to repositories like AWS ECR.

Responsibilities

THE ROLE

CoMind is seeking a skilled DevOps Engineer to join our dynamic Research Data Science team to lead the orchestration of a robust ML training pipeline in AWS. This role is critical to enabling the scalable training and testing of a range of ML models on large volumes of a totally new form of clinical neuromonitoring data.

RESPONSIBILITIES:

Architect and implement a scalable solution to support the Research Data Science Team in running a large number of assorted machine learning pipelines, including model training, evaluation, and inference
Create a CI/CD pipeline for building containers from in-house Python packages, running integration tests, and publishing to AWS ECR
Set up ECS or AWS Batch Tasks to run containers stored in AWS ECR
Establish a robust configuration management system to store, version, and retrieve configurations associated with multiple machine learning workflows
Implement robust error handling and monitoring solutions to ensure timely debugging across the pipeline with centralised logging and error reporting
Implement cost monitoring solutions to track and manage compute costs across different runs, building dashboards to provide insights into resource usage and cost optimization
Ensure security and data protection are integrated into the pipelines by applying AWS best practices for security protocols and data management
Monitor and manage the team’s compute resources, including both cloud (AWS) and on-premise GPU nodes, ensuring efficient use and scalability
Implement Infrastructure as Code (IaC) to set up and manage the pipeline architecture, using Terraform, AWS CloudFormation, or similar tools.