Head of SRE/ DevOps at Leaf Space

22074 Lomazzo, Lombardia, Italy -

Full Time

Start Date

Immediate

Expiry Date

18 Jul, 25

Salary

0.0

Posted On

14 Jun, 25

Experience

0 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Interpersonal Skills, Computer Science, Reliability Engineering, Microservices, Logging, Architecture, Security, Complex Systems, Distributed Systems

Industry

Information Technology/IT

Description

Leaf Space is a rapidly growing scale-up company and a leading provider of ground segment as-a-service (GSaaS) solutions. Our innovative and proprietary concept is focused on providing satellite and launch vehicle connectivity as-a-service, enabling clients to efficiently manage their assets and fully exploit data. Our GSaaS solutions have been recognized by the market for their efficiency, security, and effectiveness in supporting different applications, from remote sensing to IoT communications.
At Leaf Space, we operate with a flat organizational structure, we are growing in economic and people aspects, we are working in a professional and autonomous manner in fast paced environment. We prioritize hiring top talent and cultivating a collaborative, high-achieving, and supportive workplace. Our core team is headquartered in Lomazzo (Como), Italy, and we have expanded our presence in the U.S. with headquarters based in Northern Virginia.
As we continue to develop innovative technologies to support the NewSpace economy, we are looking for world-class talent ready to tackle challenging projects to drive expansion and sustainability of the space ecosystem.
Leaf Space offers comprehensive benefits, and flexible remote work options to support our employees in achieving their goals.

HEAD OF DEVOPS/SERVICE RELIABILITY ENGINEER

We are seeking a highly experienced Head of Service Reliability Engineering (SRE) to lead our dynamic and fast-paced team. In this leadership position, you will oversee a team responsible for shaping our cloud architecture while ensuring the deployment, reliability, scalability, and performance of our systems and services, maintaining our infrastructure to the highest standards of excellence.
As the Head of SRE, you will collaborate with cross-functional teams, including developers, operations, mission managers, and quality assurance, to design, implement, and maintain robust and efficient infrastructure solutions. You will set the strategic direction for the SRE team, fostering a culture of reliability and continuous improvement.
Initially, you will work alongside the current Head of Service Reliability for a minimum of six months before assuming full leadership of the team.

RESPONSABILITIES

Strategic Oversight of System Reliability: Lead the development and maintenance of highly available and scalable infrastructure systems, ensuring optimal performance and reliability across all services.
Incident Management Leadership: Oversee the incident management process, ensuring prompt and effective responses to production incidents. Facilitate post-incident analyses to identify root causes and implement preventive measures.
Monitoring and Alerting Strategy: Guide the development of comprehensive monitoring and alerting systems to proactively identify potential issues and ensure system health. Drive continuous improvement in monitoring processes and tools.
Automation and Tooling Direction: Provide leadership in the design and implementation of automation tools and frameworks to streamline operational processes, enhance system resilience, and minimize manual efforts.
Infrastructure Optimization Management: Identify opportunities for infrastructure optimization and collaborate with development teams to implement performance tuning strategies and improve system architecture
Capacity Planning Oversight: Analyze system capacity and usage patterns, forecast future growth, and work with stakeholders to strategically plan and scale infrastructure resources as needed.
Collaboration and Communication Facilitation: Foster effective collaboration and communication among cross-functional teams, including developers, operations, and quality assurance, to align efforts towards common goals.
Documentation and Knowledge Management Leadership: Ensure comprehensive documentation of system configurations, troubleshooting guides, and operational procedures. Promote knowledge sharing initiatives and maintain an up-to-date knowledge base.
Proactive Incident Prevention and Reliability Engineering: Lead efforts to identify potential system vulnerabilities, performance bottlenecks, and reliability risks. Develop and implement preventive measures, conduct system failure simulations, and participate in architectural reviews to enhance overall system reliability.

QUALIFICATIONS AND REQUIREMENTS

Proven experience in managing and mentoring teams, with a track record of fostering a collaborative and high-performance work environment.
Excellent communication and interpersonal skills, with the ability to collaborate effectively with diverse teams and stakeholders.
Bachelor’s degree in computer science, Engineering, related fields, or equivalent experience in previous roles.
Proven experience in Service Reliability Engineering (SRE) or related role (DevOps or similar roles are preferable), with a strong focus on managing large-scale, distributed systems.
Advanced experience with cloud platforms (e.g., AWS, Azure, GCP, the current technology stack is on GCP.)
Knowledge and understanding containerization technologies (e.g., Docker, Kubernetes) and microservices architecture.
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack) and experience in building observability solutions.
Understanding of networking principles, protocols (e.g., TCP/IP, HTTP), and load balancing techniques.
Strong problem-solving skills and the ability to analyze complex systems to identify root causes and implement effective solutions.
Demonstrated ability to work in a fast-paced, dynamic environment and adapt to changing priorities and technologies.
Love for Space ️️

Preferred Qualifications:

Experience with serverless computing and event-driven architecture.
Experience with the Python programming language
Knowledge of security best practices

Responsibilities

Please refer the Job description for details