Site Reliability Engineer - SRE at ait global inc

Montréal, QC, Canada -

Full Time

Start Date

Immediate

Expiry Date

05 Dec, 25

Salary

45.0

Posted On

06 Sep, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Powershell, Distributed Systems, Itil, Splunk, Security, Firewalls, Docker, Jenkins, Analytics, Scripting, Automation, Orchestration, Python, Kubernetes, Devops

Industry

Information Technology/IT

Description

JOB DESCRIPTION:

We are seeking a Production Support Engineer with expertise in Azure Cloud for the IT Reliability & Production Engineering (RPE) domain. The role focuses on ensuring the stability, performance, and reliability of production systems in a high-availability environment.

REQUIRED SKILLS:

Strong experience in Azure Cloud (Azure Monitor, App Insights, Log Analytics).
Scripting & Automation: PowerShell, Python, Terraform, or Ansible.
Monitoring & Observability: Prometheus, Grafana, Splunk, or Datadog.
Incident Management: ITIL, SRE principles, RCA methodologies.
CI/CD & DevOps: Azure DevOps, GitHub Actions, Jenkins.
Containers & Orchestration: Kubernetes (AKS), Docker.
Networking & Security: Load balancers, firewalls, IAM, VPNs.

PREFERRED QUALIFICATIONS:

Experience with large-scale distributed systems.
Familiarity with SQL/NoSQL databases.
Knowledge of Cloud-Native architecture

DESCRIPTION DU POSTE :

Nous recherchons un ingénieur support production spécialisé dans Azure Cloud pour le domaine de la fiabilité informatique et de l’ingénierie de production (RPE). Le poste consiste principalement à garantir la stabilité, les performances et la fiabilité des systèmes de production dans un environnement à haute disponibilité.

QUALIFICATIONS SOUHAITÉES :

Expérience avec les systèmes distribués à grande échelle.
Connaissance des bases de données SQL/NoSQL.
Connaissance de l’architecture cloud native.
Job Type: Fixed term contract
Contract length: 12 months
Pay: $45.00-$60.00 per hour
Work Location: In perso

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

Monitor, troubleshoot, and resolve production incidents and service disruptions.
Ensure system reliability, availability, and performance using SRE principles.
Automate manual processes and optimize cloud infrastructure on Azure.
Analyze logs, metrics, and alerts to prevent incidents.
Collaborate with development, DevOps, and infrastructure teams for issue resolution.
Implement CI/CD pipelines, observability, and proactive monitoring strategies.
Maintain Azure resources (VMs, AKS, Storage, Networking, etc.).
Participate in on-call rotations for critical production support.