Site Reliability Engineer at CLOUD BRIDGE

Marlow SL7 3AA, , United Kingdom -

Full Time

Start Date

Immediate

Expiry Date

09 May, 25

Salary

0.0

Posted On

09 Feb, 25

Experience

0 year(s) or above

Remote Job

Telecommute

Sponsor Visa

Skills

Jenkins, Ruby, Kubernetes, Microservices, Python, Cloud Security, Azure, Docker, Infrastructure, Incident Response, Large Scale Systems, Storage, Firewalls, Aws, Containerization, Distributed Systems, Orchestration, Bash

Industry

Information Technology/IT

Description

The Site Reliability Engineer (SRE) will play a key role in maintaining and scaling infrastructure, ensuring reliability, performance, and scalability. You will collaborate closely with development, operations, and security teams to improve the reliability and efficiency of applications, addressing incidents, automating processes, and managing infrastructure as code.

REQUIRED SKILLS & EXPERIENCE:

Hands-on experience with AWS, GCP, or Azure for managing compute, storage, and networking services.
Proficiency in using Terraform, CloudFormation, Ansible, or similar tools for automating infrastructure.
Strong experience in monitoring and incident response using tools like Prometheus, Grafana, and ELK Stack.
Strong scripting skills in Python, Bash, Go, or Ruby for automating tasks and building custom tools.
Experience with CI/CD pipelines (Jenkins, GitLab CI) and optimizing performance for large-scale systems.
Familiarity with cloud security, access controls, firewalls, and networking best practices.

PREFERRED QUALIFICATIONS:

Certifications: AWS Certified DevOps Engineer, Google Professional Cloud Architect, or similar.
Containerization & Orchestration: Experience with Docker, Kubernetes, or ECS/EKS for containerized applications.
SRE Experience: Familiarity with SRE principles like SLAs, SLOs, and error budgets, and practical application of those in large-scale systems.
Distributed Systems: Understanding of microservices, service discovery, and fault-tolerant architectures
If you are an experienced Site Reliability Engineer with a passion for building and maintaining highly available systems, we want to hear from you!

Responsibilities

Build and scale cloud infrastructure (AWS, GCP, or Azure), automate provisioning using Terraform or CloudFormation, and manage resources for optimal performance.
Monitor, troubleshoot, and resolve incidents, optimizing systems to ensure reliability and minimize downtime.
Implement monitoring (Prometheus, Grafana, Datadog) and set up alerting systems to proactively address issues and ensure scalability.
Work with DevOps, engineering, and security teams to improve application deployment, infrastructure management, and system resilience.
Develop disaster recovery strategies, ensure infrastructure security through best practices, and maintain business continuity.
system documentation and lead continuous improvement initiatives to streamline operations.