Machine Learning Engineer at Alberta Machine Intelligence Institute

Edmonton, AB T5J 3B1, Canada -

Full Time

Start Date

Immediate

Expiry Date

06 Dec, 25

Salary

0.0

Posted On

07 Sep, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Computer Science, Training, Resource Efficiency, Proxmox, Information Technology, Linux, Kubernetes, Data Science, Docker

Industry

Information Technology/IT

Description

“I’m incredibly excited to welcome a new Machine Learning Engineer to our team! This role is perfect for someone passionate about diving deep into system architecture and large-scale, GPU-enabled high-performance computing clusters, optimizing AI workflows, and shaping the future of our infrastructure. We’re looking for a collaborative individual who thrives on both technical excellence and guiding others, ultimately making a significant impact on our research productivity and the advancement of state-of-the-art AI models. I can’t wait to see the innovative solutions you’ll bring to Amii!”
– Greg Burlet, Director of Engineering

QUALIFICATIONS:

Post Secondary Degree in Computer Science, Information Technology, Data Science, or a related field
Advanced Degrees or Certifications in High-Performance Computing (HPC), Computer Science, ML/AI or Cloud Infrastructure (nice to have)
3+ years of experience in systems architecture: DevOps / MLOps and multi-node compute clusters including networking, IaC (terraform, ansible), CI/CD pipelines, VMs and hypervisors (proxmox), linux, node provisioning (warewulf), containers and container orchestration (docker, kubernetes), logging stacks (ELK), and job schedulers (slurm)
5+ years of experience in Python programming with application to ML/RL (PyTorch, TensorFlow, distributed training with DDP or FSDP), systems administration, or software development roles in UNIX/Linux or large-scale HPC environments (experience working with GPUs preferred), as well as associated cloud computing platforms
Knowledge and experience training reinforcement learning models nice to have
Source Code Analysis: Proficiency in analyzing the source code of AI-based software
Experience in academic or industry research environments
Experience with optimizing large-scale AI models for resource efficiency

WHAT YOU’LL LOVE ABOUT US

A professional yet casual work environment that encourages the growth and development of your skills.
Participate in professional development activities
Gain access to the Amii community and events
A chance to learn from amazing teammates who support one another to succeed.
Competitive compensation, including paid time off and flexible health benefits.
A modern office located in downtown Edmonton, Alberta.

Responsibilities

ABOUT THE ROLE

The Machine Learning (ML) Engineer plays a key role in ensuring machine learning research and applied AI projects operate securely and effectively. As a key member of the team, the ML Engineer will collaborate with senior engineering leaders to deploy and manage computing infrastructure, optimize AI workflows, develop training materials, and contribute to the technical development of both individuals and the organization.
The ML Engineer will work with cross-functional teams and external partners to support the execution of research and applied projects. Specifically, this engineer will focus on implementing and managing the software stack of High-Performance Computing (HPC) systems and pipelines, ensuring efficient and effective allocation and utilization of compute resources such as GPUs, and providing support for users of these systems. This role is critical to advancing research productivity and enabling state-of-the-art machine learning models.
In addition to hands-on technical work, the ML Engineer will contribute to the strategic planning of our infrastructure, working alongside the Director, Engineering, and the Director, IT to develop strategies, playbooks, and best practices for optimizing our tools, frameworks, and services.

The role focuses on achieving excellence in three main accountabilities:

Infrastructure and Systems Management
AI Workflow Optimization
Technical Coaching and Collaboration

KEY RESPONSIBILITIES:

Assists in the design, implementation, and management of High-Performance Computing (HPC) clusters to support AI research in machine learning (ML) and reinforcement learning (RL)
Identifies and resolves end-user queries related to AI workflows, providing configuration support for software coding issues and infrastructure setups
Oversees computing resources in the cloud or on premises to ensure secure and efficient operations, with a focus on optimizing resource utilization and availability for AI workflows
Provides support by diagnosing and resolving technical issues, performing routine system maintenance, and enhancing the performance of supporting infrastructures
Assists in the development and refinement of playbooks and strategies to maximize the use of tools, infrastructure, libraries, and frameworks
Collaborates with university system administrators, AI researchers, and support staff to facilitate the integration and operation of AI/ML systems
Leverages strong knowledge of Linux-based systems (e.g. Red Hat, CentOS, Ubuntu), proficiency in scripting languages (e.g. Bash, Python), Infrastructure as Code (e.g. terraform, ansible), container and orchestration systems (e.g., Docker, Kubernetes), and github workflows to automate tasks and optimize infrastructure performance
Establishes monitoring and logging systems to track infrastructure performance, proactively detect anomalies, and ensure real-time alerts for system integrity and uptime
Implements and maintains automated AI pipelines to streamline model development and ensure effective use of computational resources
Monitors, assesses and analyzes data from ML projects to ensure effective model performance and project outcomes
Utilizes strong analytical skills to visualize data, analyze statistical trends, and assess the effectiveness of AI software outputs
Applies hands-on experience with AI tools and frameworks (e.g. TensorFlow, PyTorch, Keras) to support deep learning initiatives
Installs, configures, and diagnoses software applications for machine learning algorithms and GPU-based computing to ensure seamless operation
Participates in training, code reviews, and coaching to enhance team members’ technical capabilities
Designs and delivers technical training on AI/ML workflows, guiding researchers on how to optimize their workflows and make the most of available infrastructure resources
Partners with the Director, Engineering and external partners (e.g. academic institutions and industry partners) to support ML projects