Sign up with

Already have an account? Sign in here

Need some help?
Talk to us at +91 7670800001

Site Reliability Engineer, Performance at CentML

Toronto, ON, Canada -

Full Time

Start Date

Immediate

Expiry Date

14 Mar, 25

Salary

0.0

Posted On

07 Feb, 25

Experience

3 year(s) or above

Remote Job

Telecommute

Sponsor Visa

Skills

Good communication skills

Industry

Information Technology/IT

Description

ABOUT US

We believe AI will fundamentally transform how people live and work. CentML’s mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.
Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts at companies like Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who holds multiple academic and industry research awards from Google, Amazon, Facebook, and VMware.

ABOUT THE POSITION

As a Site Reliability Engineer, you will play a pivotal role in shaping the infrastructure and reliability practices at CentML. You will be responsible for working on complex projects, and collaborating with cross-functional teams to ensure our systems meet the highest standards of reliability, performance, and security. We’re looking for a Site Reliability Engineer with a strategic focus on performance optimization and testing.

Responsibilities

Build large-scale, distributed systems that support complex workloads, ensuring high availability and fault tolerance.
Contribute towards efforts in automation, configuration management, and infrastructure-as-code, minimizing manual operations and ensuring consistency.
Optimize the performance and scalability of our systems, identifying and addressing bottlenecks before they impact users.
Participate in incident response efforts, including real-time troubleshooting, root cause analysis, and postmortem reviews.
Develop and maintain comprehensive monitoring, alerting, and logging systems that provide deep visibility into system health and performance.
Drive continuous improvement in system reliability, performance, and scalability through the adoption of new technologies, tools, and methodologies.
Stay current with industry trends and innovations in SRE and ML infrastructure, bringing new ideas and approaches to the team.