Senior Site Reliability Engineer at Comfy

San Francisco, California, USA -

Full Time

Start Date

Immediate

Expiry Date

30 Oct, 25

Salary

300000.0

Posted On

30 Jul, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Aws, Azure, Communication Skills, Kubernetes

Industry

Information Technology/IT

Description

YOU ARE A GOOD FIT IF THIS DESCRIBES YOU:

You possess a strong understanding of foundational cloud infrastructure (AWS/GCP/Azure) and Linux provisioning/management tools.
You know how to design for reliability and scale with minimal operational overhead.
You learn new technologies rapidly because you’re excited by solving hard infrastructure challenges.
You’ve scaled infrastructure before and understand the tradeoffs that matter.
You think most infrastructure moves too slowly and could be way better automated and optimized.
You’re comfortable diving into unfamiliar systems and making them work reliably.
You are a self-starter who executes quickly, takes ownership, and constantly seeks improvement.

REQUIREMENTS:

You have relevant experience as an SRE for a high tech startup.
Experience in participating in incident management processes.
Strong foundation and experience in managing cloud infrastructure (AWS, GCP, or Azure). Experience with bare metal is a plus.
Solid understanding of container orchestration (Kubernetes preferred) and CI/CD principles and tools.
Excellent communication skills.
Proven ability to learn fast and ship quality infrastructure code and configurations.

ABOUT US

We are a small, intense, and well-funded team in San Francisco who push ComfyUI and its ecosystem forward. Our team comes from Stability AI and Google and many contributed to the ComfyUI ecosystem way before working here.
Our organization is flat and there is no hierarchy, only categories: dev, arts, prod, ops, etc (and no, there is no one here with the title of Member of Technical Staff, it’s long and silly for a job title).
The only thing that matters is the quality of your cultural fit and execution. We work hard and demand a lot of each other. But we have fun: everyone is here to make something meaningful that will end up being our life’s work. If this mission excites you and you view yourself as a top-tier talent, your future latent self is waiting for you at Comfy.
Check out our Github and blog for what we’ve been working on. Our investors include Pace Capital, Chemistry, Abstract Venture, and Guillermo Rauch.
Compensation Range: $150K - $300

Responsibilities

THE ROLE

We are looking for an SRE to join our infrastructure team. This role will be responsible for ensuring the reliability of our back-end systems, working with engineers who develop them, and planning for our future growth. Our core infrastructure relies heavily on Kubernetes (K8s), Terraform, and GCP, but we care more about your ability to learn, adapt, and ship robust solutions than whether you’ve used these exact tools before.

WHAT YOU’LL DO:

Develop and maintain our core Python platform for routing requests, orchestrating AI workloads, managing GPU server capacity, observability, and more.
Develop and maintain our infrastructure layer using Terraform and cloud provider APIs to manage our fleet of GPU workers across cloud and potentially bare metal environments.
Own and operate the technologies underpinning our platform, potentially including K8s, FluxCD, Nomad, Prometheus, Thanos, DataDog, Loki, distributed networking/storage, etc.
Architect and implement solutions that directly impact the performance and availability of services for millions of ComfyUI users.
Work closely with our core engineering team to design and build new infrastructure systems.
Help create the vision and lay the foundation for where our infrastructure should go in the next 1/2/5 years.
Help shape our technical direction and infrastructure best practices as we grow.