Site Reliability Engineer, Performance at CentML
Toronto, ON, Canada -
Full Time


Start Date

Immediate

Expiry Date

14 Mar, 25

Salary

0.0

Posted On

07 Feb, 25

Experience

3 year(s) or above

Remote Job

No

Telecommute

No

Sponsor Visa

No

Skills

Good communication skills

Industry

Information Technology/IT

Description

ABOUT US

We believe AI will fundamentally transform how people live and work. CentML’s mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.
Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts at companies like Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who holds multiple academic and industry research awards from Google, Amazon, Facebook, and VMware.

ABOUT THE POSITION

As a Site Reliability Engineer, you will play a pivotal role in shaping the infrastructure and reliability practices at CentML. You will be responsible for working on complex projects, and collaborating with cross-functional teams to ensure our systems meet the highest standards of reliability, performance, and security. We’re looking for a Site Reliability Engineer with a strategic focus on performance optimization and testing.

Responsibilities
  • Build large-scale, distributed systems that support complex workloads, ensuring high availability and fault tolerance.
  • Contribute towards efforts in automation, configuration management, and infrastructure-as-code, minimizing manual operations and ensuring consistency.
  • Optimize the performance and scalability of our systems, identifying and addressing bottlenecks before they impact users.
  • Participate in incident response efforts, including real-time troubleshooting, root cause analysis, and postmortem reviews.
  • Develop and maintain comprehensive monitoring, alerting, and logging systems that provide deep visibility into system health and performance.
  • Drive continuous improvement in system reliability, performance, and scalability through the adoption of new technologies, tools, and methodologies.
  • Stay current with industry trends and innovations in SRE and ML infrastructure, bringing new ideas and approaches to the team.
Loading...