Site Reliability Engineer II

at  Stacklok

London, England, United Kingdom -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate16 Feb, 2025Not Specified18 Nov, 20245 year(s) or aboveKey Performance Indicators,Code,Issue Identification,Security,Kubernetes,Configuration Management,Google,Operational Excellence,Platforms,Service Quality,Supply Chain,Performance Reviews,Azure,Docker,Communication Skills,Automation ToolsNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

Stacklok is an innovative software supply chain security startup founded by Kubernetes co-founder, Craig McLuckie and Sigstore founder, Luke Hinds. Our mission is to make it easier to securely develop software. With our deep expertise in open source technologies and commitment to enhancing software security, we are seeking highly skilled and motivated individuals to join our team. This is a rare opportunity to join a startup at an early stage, and to be part of a team that is committed to building something truly innovative and impactful. Learn more about Stacklok’s mission, virtues, and leadership, HERE.

ELEVATOR PITCH

Stacklok Cloud is a comprehensive security platform that combines open source package intelligence with a policy platform built on the open source project, Minder, allowing developers to securely consume open source software while enabling security teams to effectively manage and maintain a robust security posture across the entire software supply chain.
As Stacklok Cloud is delivered to major companies across the world, ensuring its scalability, security, performance, and reliability is essential. We’re seeking a Site Reliability Engineer II to contribute to initiatives within a product team, focusing on automation, monitoring, configuration management, continuous delivery, and incident response. This role involves applying both SRE and software engineering expertise to ship new features and serve as a resource for best practices in reliability, performance metrics, and system resilience. Additionally, participation in Stacklok’s SRE guild will be integral, collaborating with peers to drive consistent practices in automation, observability, and reliability across all products, fostering a seamless and high-performing SaaS platform.
Join our team of exceptionally talented engineers and become part of a groundbreaking field that tackles critical challenges for developers and the OSS community. Contribute to an open source strategy that focuses on building and expanding an ecosystem for diverse OSS tools, and help shape the future of open source development with innovative and impactful work.

DESIRED SKILL & EXPERIENCE

  • Experience in Site Reliability Engineering supporting an enterprise SaaS service with evidence of maintaining high availability and performance in production environments.
  • Proficient in programming languages, particularly Go or Python, demonstrating the ability to write clean, efficient, and maintainable code.
  • Familiarity with Infrastructure as Code (IaC) principles, with proficiency in automation tools like Terraform for environment provisioning and configuration management.
  • Experience with a major cloud provider (AWS, Azure, Google), preferably AWS.
  • Understanding of cloud-native application deployment and management using technologies like Docker and Kubernetes with exposure to scaling and recovery strategies.
  • Experience in automating incident response processes using platforms such as PagerDuty to improve response times and incident management efficiency.
  • Proficient in log aggregation and analysis tools such as AWS Athena and Cloudwatch enabling thorough performance reviews and proactive issue identification.
  • Exposure to defining and implementing Service Level Objectives (SLOs) and key performance indicators (KPIs) to drive service quality and operational excellence.
  • Knowledge of security best practices in site reliability, with an emphasis on operational security measures and maintaining a secure software supply chain.
  • Impact-Driven and Collaborative: Track record of delivering solutions that drive business outcomes; excellent written and verbal communication skills for engaging diverse stakeholders. Committed to fostering growth and continuous improvement within teams.
  • Versatile and Self-Starting: Adaptable in dynamic, startup environments, comfortable in varied roles—from individual contributor to conference presenter—and skilled at making technical topics accessible to broad audiences.

Responsibilities:

SUCCESS IN THE ROLE: 6-12 MONTHS EXPECTATIONS

  • Acclimatize to the Team: Familiarize yourself with our engineering processes. Build connections with team members, immerse yourself in our company culture, understand our virtues, and learn the way we work and collaborate. Understand Our Products and Services: Develop a strong grasp of Stacklok’s products and services, our platform vision and goals, both immediate and future, to enable alignment between your contributions and our objectives.
  • Deep Dive Into Stacklok Cloud Architecture: Become comfortable with the current infrastructure-as-code environment using Terraform to deploy SaaS software to Kubernetes on AWS. Apply tools like Argo CD for continuous delivery, Helm for managing Kubernetes packages and Github Actions for workflow automation.
  • Proficiency in Go and Python: Develop proficiency in Go, our primary programming language, focusing on best practices, idiomatic design patterns, and effective error handling, and unit testing. Demonstrate intermediate knowledge of Python, specifically in leveraging its capabilities for automation, scripting, and building internal tools.
  • Hybrid Contribution: As part of the SRE team, you’ll primarily balance contributions between product reliability improvements and company-wide infrastructure enhancements, advancing Stacklok’s platform underpinnings. Additionally, you’ll make direct contributions to feature development, further enhancing the capabilities of our products and services.
  • Technical Guidance and Documentation: Support production infrastructure by contributing to and maintaining comprehensive documentation, including playbooks and architectural diagrams, to ensure team alignment.
  • On-Call Rotation Responsibilities: Responsible for on-call duties every 5-6 weeks with a 2-week on-call rotation. During each rotation, you will alternate with the other engineer on-call between primary and secondary roles. The primary role involves leading incident resolution and communication, while the secondary role provides support with troubleshooting and monitoring.

IN THIS ROLE YOU WILL HAVE THE OPPORTUNITY TO:

  • Shape The Future of Stacklok Cloud: As a site reliability engineer, you’ll play a key role in supporting and enhancing our platform’s reliability and performance. Your focus will include regular platform upgrades and the instrumentation and monitoring of production systems. You’ll help advance our platform and shape strategies for the future of software supply chain security.
  • Embrace an Automate Everything Mindset: Contribute to a culture of automation by streamlining operational tasks and enhancing efficiency across the environment. You’ll support automation initiatives for incident management tooling, application autoscaling, and recovery processes to ensure resilient systems that adapt to changing demands. Collaborating with a skilled team, you’ll help automate playbooks, continuous delivery pipelines, and GitHub Terraform processes, driving improvements in service delivery and incident response.
  • Monitor and Improve Service Performance: Support end-to-end monitoring of service KPIs to drive improvements and maintain optimal performance. You’ll regularly review logs and performance metrics, using shared tools and incident response automations to enhance system reliability. With an analytical mindset, you’ll contribute to identifying areas for KPI improvement, helping us consistently meet and exceed our performance goals.
  • Learn and Grow with Mentorship Opportunities: Work alongside experienced engineers who will support your professional growth and skill development. By collaborating in a culture of empathy, curiosity, and psychological safety, you’ll deepen your understanding of infrastructure and site reliability best practices. Engaging in code reviews and team discussions will allow you to refine your skills, share insights, and contribute to a strong, capable team. This role offers a clear path for growth, helping you build toward new responsibilities and technical expertise.
    We understand that not everyone will meet every requirement listed, and that’s perfectly okay! We encourage you to apply regardless of your self-assessment. We value a diverse range of skills and experiences and believe that your unique attributes can make a significant impact. We want to hear from you!


REQUIREMENT SUMMARY

Min:5.0Max:12.0 year(s)

Information Technology/IT

IT Software - Application Programming / Maintenance

Software Engineering

Graduate

Proficient

1

London, United Kingdom