Site Reliability Engineer, Distribution Engineering at NBCUniversal
Stamford, CT 06902, USA -
Full Time


Start Date

Immediate

Expiry Date

10 Oct, 25

Salary

145000.0

Posted On

11 Jul, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Splunk, Aws, Multicast, Infrastructure, Ansible, Bash, Containerization, Docker, Code, Linux System Administration, Python, Jenkins, Ownership, Kubernetes, Automation, Computer Science, Orchestration

Industry

Information Technology/IT

Description

Company Description
NBCUniversal is one of the world’s leading media and entertainment companies. We create world-class content, which we distribute across our portfolio of film, television, and streaming, and bring to life through our theme parks and consumer experiences. We own and operate leading entertainment and news brands, including NBC, NBC News, MSNBC, CNBC, NBC Sports, Telemundo, NBC Local Stations, Bravo, USA Network, and Peacock, our premium ad-supported streaming service. We produce and distribute premier filmed entertainment and programming through Universal Filmed Entertainment Group and Universal Studio Group, and have world-renowned theme parks and attractions through Universal Destinations & Experiences. NBCUniversal is a subsidiary of Comcast Corporation.
Our impact is rooted in improving the communities where our employees, customers, and audiences live and work. We have a rich tradition of giving back and ensuring our employees have the opportunity to serve their communities. We champion an inclusive culture and strive to attract and develop a talented workforce to create and deliver a wide range of content reflecting our world.
Comcast NBCUniversal has announced its intent to create a new publicly traded company (’Versant’) comprised of most of NBCUniversal’s cable television networks, including USA Network, CNBC, MSNBC, Oxygen, E!, SYFY and Golf Channel along with complementary digital assets Fandango, Rotten Tomatoes, GolfNow, GolfPass, and SportsEngine. The well-capitalized company will have significant scale as a pure-play set of assets anchored by leading news, sports and entertainment content. The spin-off is expected to be completed during 2025.
Job Description

NBCUniversal is seeking creative and driven Site Reliability Engineers to join our Distribution Engineering team. This team supports the infrastructure and systems that power NBCU’s broadcast, streaming, and monitoring platforms. Within Distribution Engineering, we’re hiring SRE’s across three closely integrated focus areas: Video Streaming, Monitoring & Control, and Playout. As an SRE, you will be responsible for the engineering, operations, support, deployment, and maintenance of critical systems across on-premises and cloud environments. You will work in a fast-paced, agile environment where innovation and reliability are key.

  • Develop automation to deploy, maintain, and monitor infrastructure and applications.
  • Troubleshoot and resolve issues in live, on-air environments.
  • Participate in CI/CD pipelines, including code deployment, testing, and monitoring.
  • Create and maintain system metrics, dashboards, and alerting to ensure high availability.
  • Collaborate with engineering, operations, and vendor teams to support system health and performance.
  • Act as a Level 2 support resource for broadcast-related incidents, including root cause analysis and documentation.
  • Participate in on-call rotation for 24/7 support coverage.
  • Evaluate new technologies and contribute to proof-of-concept deployments.
  • Document system configurations, incident resolutions, and operational procedures.

Qualifications

REQUIREMENTS:

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 3+ years of SRE experience in the technology sector supporting and maintaining production-quality software or software-defined infrastructure in a high traffic environment run in cloud environments (AWS preferred)
  • Experience with IP video and broadcast technologies.
  • Proficiency in Linux system administration.
  • Experience with Infrastructure as Code (Terraform or CloudFormation) and configuration management technologies (Ansible).
  • Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins, ArgoCD).
  • Experience with containerization and orchestration (Docker, Kubernetes, EKS).
  • Scripting experience (Python, Bash, or similar).
  • Strong understanding of networking fundamentals and troubleshooting.
  • Experience with monitoring/logging tools (e.g., Grafana, Splunk, ELK, CloudWatch).
  • Comfortable working in agile, fast-paced environments.
    Hybrid: This position has been designated as hybrid, generally contributing from the Stamford, CT office a minimum of 3 days per week.

PREFERRED QUALIFICATIONS:

  • Experience maintaining both Linux and Windows environments
  • Familiarity with broadcast and monitoring tools such as Dataminer, TAG systems, and/or MediaProxy
  • Strong hands-on experience debugging and troubleshooting distributed microservices in Kubernetes, including analyzing pod logs
  • Solid understanding of networking concepts relevant to video streaming, including multicast, unicast, RTP/RTMP, and CDN workflows
  • Ability to take ownership of problems and drive solutions through automation where applicable (Automation-first mentality)
Responsibilities
  • Develop automation to deploy, maintain, and monitor infrastructure and applications.
  • Troubleshoot and resolve issues in live, on-air environments.
  • Participate in CI/CD pipelines, including code deployment, testing, and monitoring.
  • Create and maintain system metrics, dashboards, and alerting to ensure high availability.
  • Collaborate with engineering, operations, and vendor teams to support system health and performance.
  • Act as a Level 2 support resource for broadcast-related incidents, including root cause analysis and documentation.
  • Participate in on-call rotation for 24/7 support coverage.
  • Evaluate new technologies and contribute to proof-of-concept deployments.
  • Document system configurations, incident resolutions, and operational procedures
Loading...