MLOps Engineer (Remote) at CDM Smith

Toronto, ON, Canada -

Full Time

Start Date

Immediate

Expiry Date

05 Dec, 25

Salary

144477.0

Posted On

06 Sep, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Kubernetes, Segmentation, Google Cloud Platform, Relational Databases, Containerization, Infrastructure, Go, Travel, Python, Java, Docker, Aws, International Travel, Machine Learning, Javascript, Languages, Azure, Secure Network Architecture

Industry

Information Technology/IT

Description

Trinnex, a wholly owned subsidiary of CDM Smith is seeking a MLOps Engineer with specialization in AI platform to join our growing team. Trinnex is building next generation tools that integrate sensor/IoT data, models, geospatial data and machine learning to solve unique engineering and environmental issues.
This position is based in Toronto, Ontario; candidates located in Vancouver, BC or Edmonton, Alberta may also be considered.
In this role, you will own the operational backbone for our AI and Data Engineering products. You will be responsible for the end-to-end production lifecycle of our ML models, from helping build the application services that wrap them to creating the automated systems for their deployment. Your ultimate goal is to ensure the overall health, scalability, and reliability of these machine learning systems in production. This requires close collaboration with internal resources to research and implement MLOps best practices, driving continuous improvement and automation across our platforms.

Responsibilities:

Design, build, and maintain scalable and reliable infrastructure to support the entire machine learning lifecycle, from experimentation and training to deployment and monitoring.
Develop and manage robust CI/CD pipelines for ML models and associated software services, ensuring automated, high-quality releases.
Collaborate closely with Data Scientists to containerize, deploy, and operationalize machine learning models, implementing solutions for both batch prediction and real-time inference use cases.
Collaborate with teams to architect generative AI applications, providing expert guidance on connecting LLMs to proprietary data sources and enabling them to execute tasks on behalf of users.
Champion MLOps best practices and empowers the Data Science team by providing guidance, training, and support for new tools and automated workflows.
Partner with Software Engineers to define and implement modern service architectures, including microservices and APIs, for ML-powered applications.
Implement and manage cloud infrastructure using Infrastructure as Code (IaC) principles to ensure environments are reproducible, secure, and auditable.
Establish and maintain comprehensive monitoring, logging, and alerting systems to track model performance, data drift, and infrastructure health, and aid in incident response.
Work with cybersecurity and architecture teams to design and enforce security best practices across our cloud environment, including network configuration, identity management, and data protection.
Maintain clear and detailed documentation for MLOps processes, infrastructure, and best practices.

Skills and Abilities:

Excellent software engineering fundamentals, with a solid understanding of modern software service architecture (e.g., microservices, APIs) and CI/CD principles.
Deep, hands-on expertise with containerization (Docker) and container orchestration (Kubernetes).
Proven experience designing, building, and securing infrastructure on a major cloud platform (e.g., GCP, AWS, Azure), with a firm grasp of core concepts like identity and access management (IAM) and secure network architecture, including VPCs, firewall policies, and segmentation.
Demonstrable understanding of the end-to-end machine learning lifecycle and experience deploying models for both batch and real-time/live inference workloads.
Experience working with and understanding the trade-offs between different data storage paradigms, such as relational databases (e.g., PostgreSQL), analytical data warehouses (e.g., BigQuery), and cloud object storage (e.g., GCS, S3).
Solid understanding of Python.
Excellent communication, interpersonal, and organizational skills, with a demonstrated ability to manage and prioritize multiple tasks effectively, both independently and as part of a team.

LI-LP1

MINIMUM QUALIFICATIONS

Bachelor’s Degree.
5 years of related experience.
Equivalent additional directly related experience will be considered in lieu of a degree.
Domestic and/or international travel may be required. The frequency of travel is contingent on specific duties, responsibilities, and the essential functions of the position, which may vary depending on workload and project demands.

PREFERRED QUALIFICATIONS

Professional experience with Google Cloud Platform (GCP), especially its AI/ML services like Vertex AI.
Hands-on experience building applications that connect LLMs to external systems, such as using Retrieval-Augmented Generation (RAG) for querying data or enabling tool use (function calling). Familiarity with frameworks like LangChain is a plus.
Experience with core MLOps components, including experiment tracking (e.g., MLFlow, Vertex AI Experiments) and model registries.
Experience with modern workflow orchestration frameworks designed for machine learning (e.g., Kubeflow Pipelines, Flyte, or Prefect).
Intermediate to advanced knowledge of Infrastructure as Code (IaC) tools, particularly Terraform.
Experience managing Kubernetes applications using Helm.
Experience with specific CI/CD tools (e.g., Azure DevOps Pipelines).
Hands-on experience with service mesh technologies like Istio.
Broader coding and debugging skills in languages such as Javascript, C#, Java or Go.

Responsibilities

Design, build, and maintain scalable and reliable infrastructure to support the entire machine learning lifecycle, from experimentation and training to deployment and monitoring.
Develop and manage robust CI/CD pipelines for ML models and associated software services, ensuring automated, high-quality releases.
Collaborate closely with Data Scientists to containerize, deploy, and operationalize machine learning models, implementing solutions for both batch prediction and real-time inference use cases.
Collaborate with teams to architect generative AI applications, providing expert guidance on connecting LLMs to proprietary data sources and enabling them to execute tasks on behalf of users.
Champion MLOps best practices and empowers the Data Science team by providing guidance, training, and support for new tools and automated workflows.
Partner with Software Engineers to define and implement modern service architectures, including microservices and APIs, for ML-powered applications.
Implement and manage cloud infrastructure using Infrastructure as Code (IaC) principles to ensure environments are reproducible, secure, and auditable.
Establish and maintain comprehensive monitoring, logging, and alerting systems to track model performance, data drift, and infrastructure health, and aid in incident response.
Work with cybersecurity and architecture teams to design and enforce security best practices across our cloud environment, including network configuration, identity management, and data protection.
Maintain clear and detailed documentation for MLOps processes, infrastructure, and best practices