Site Reliability Engineer at Haystack

London, England, United Kingdom -

Full Time

Start Date

Immediate

Expiry Date

08 Nov, 25

Salary

0.0

Posted On

09 Aug, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Programming Languages, Code, Linux, Typescript, Kubernetes, Deployment Strategies, Automation, Incident Response

Industry

Information Technology/IT

Description

REQUIRED SKILLS & EXPERIENCE

Proficiency in TypeScript or similar programming languages.
Strong knowledge of OpenTelemetry and observability tools (e.g., Datadog, Grafana).
Solid grasp of SRE principles: SLIs/SLOs, automation, monitoring, and incident response.
Hands-on experience with AWS services (e.g., Lambda, ECS, S3, DynamoDB).
Proficient with Linux, command-line tools, and system-level debugging.
Experience using infrastructure-as-code tools such as Pulumi or Terraform.
Familiarity with Kubernetes, CI/CD pipelines, and automated deployment strategies.
Strong analytical and problem-solving abilities with attention to detail.

Responsibilities

ABOUT THE ROLE

Our client is seeking a Senior Site Reliability Engineer to enhance observability practices, boost system reliability, and support high availability across its platforms. This role involves close collaboration with engineering and infrastructure teams, combining software development with systems expertise to deliver dependable, observable, and efficient services.

KEY RESPONSIBILITIES

Observability Leadership: Enhance telemetry collection and processing using OpenTelemetry, prioritizing actionable and cost-efficient metrics and traces.
Reliability Standards: Guide teams in defining and adopting SLIs/SLOs and foster a culture of service ownership.
Incident Management: Lead incident response efforts, facilitate post-incident reviews, and drive implementation of long-term solutions.
Infrastructure Automation: Use tools such as Pulumi, Terraform, or AWS CDK to manage cloud infrastructure and CI/CD pipelines.
Software Development: Create tools and automation in TypeScript (with optional Rust). Contribute to shared libraries and internal platforms.
Mentorship & Collaboration: Support and mentor other engineers, promoting a reliability-focused mindset across teams.
Continuous Improvement: Explore innovative tools and practices in observability and reliability; lead proof-of-concepts and improvement initiatives.

THIS ROLE SUITS SOMEONE WHO:

Takes ownership and builds trust.
Communicates openly and supports teammates.
Is curious and continuously looks for ways to improve.
Embraces change and approaches challenges with flexibility.
Thinks long-term and works collaboratively across teams.
See more
Role tech stack
OpenTelemetry