Senior Manager -Reliability Engineer and Observability Platforms at Inspire Brands

Atlanta, GA 30328, USA -

Full Time

Start Date

Immediate

Expiry Date

06 Dec, 25

Salary

0.0

Posted On

07 Sep, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

New Relic, Azure, Infrastructure, Aws, Code, Reliability Engineering, Devops, It Operations, Dynatrace, Logging, Computer Science, Information Technology

Industry

Information Technology/IT

Description

We are seeking an experienced and dynamic Senior Manager, Reliability Engineering & Observability Platforms to lead our observability initiatives and reliability engineering efforts. This role is accountable for designing and managing platforms that ensure visibility, uptime, performance, and seamless operation of critical systems and services.
The ideal candidate will bring deep technical expertise in observability tools and reliability engineering practices, along with proven leadership experience. They will lead a team responsible for enabling high availability, incident response, performance monitoring, and operational resilience through automation, process improvement, and cross-functional collaboration.
This leader will partner closely with IT, DevOps, Infrastructure, and Business Units to deliver scalable and reliable services with a focus on proactive issue detection and resolution. This is an in-office position based in Atlanta (80% onsite).

EDUCATION AND EXPERIENCE QUALIFICATIONS

Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.
Master’s degree is a plus
5–7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering
5–7 years of experience managing and leading engineering teams.
7+ years in IT operations, DevOps, or software/platform engineering.
3+ years in a leadership role focused on observability or reliability engineering.

Responsibilities

Own and evolve observability platforms (monitoring, logging, tracing) to meet organizational needs for performance and availability.
Improve observability maturity by driving adoption of best practices and platform-wide instrumentation.
Lead the reliability engineering function focused on ensuring system uptime, operability, and resilience.
Define and track SLOs/SLIs/SLAs, partnering with product and infrastructure teams to uphold service quality standards.
Drive adoption of reliability best practices into application design, deployments, and operations.
Develop and mature incident management processes including alerting, triage, resolution, and post-incident reviews.
Oversee and continuously improve on-call strategies, ensuring the team is prepared for high-impact production events.
Champion automation of monitoring, diagnostics, deployment validation, and platform operations to reduce manual effort.
Integrate observability and reliability engineering practices into CI/CD pipelines and deployment workflows.
Mentor and lead a team of engineers with a focus on operational excellence, continuous learning, and accountability.
Build a high-performing team culture aligned to business outcomes and platform stability.
Collaborate with cross-functional teams including application developers, DevOps, cloud infrastructure, and security to ensure reliable and observable service delivery.
Partner with architecture and engineering teams to ensure new systems are designed with reliability in mind.
Provide regular reporting and insights on system health, incidents, and reliability trends to leadership.
Use telemetry data to identify system bottlenecks, recurring issues, and areas for proactive improvement.
Manage observability and reliability tool vendors, including evaluation, contracts, renewals, and integrations.