Lead Site Reliability Engineer (f/m/x) - API Platforms at Deutsche Bank
Berlin, Berlin, Germany -
Full Time


Start Date

Immediate

Expiry Date

08 May, 25

Salary

0.0

Posted On

09 Feb, 25

Experience

0 year(s) or above

Remote Job

No

Telecommute

No

Sponsor Visa

No

Skills

Software Development, Crucible, Splunk, Javascript, Teamcity, Artifactory, Kibana, Sonarqube, Vault, Java, Python, Kms, Google Cloud, Bitbucket, Github, Wso2

Industry

Information Technology/IT

Description

YOUR SKILLS AND EXPERIENCES

  • Hands-on experience with cloud ecosystems run on Google Cloud
  • Hands-on experience with Docker / Kubernetes operations with GKE or similar technology
  • Expert experience with automated infrastructure provisioning based on Terraform/TerraGrunt, Terraform Enterprise, Ansible
  • Advanced hands-on experience with Continuous Integration / Continuous Deployment (Github) and patterns for CI/CD pipelines.
  • Advanced hands-on experience of monitoring tools like Prometheus, Grafana, Kibana and alerting tools like OpsGenie, NewRelic, DataDog, Splunk, Google Operations-Suite (Stackdriver)
  • Very good knowledge of security capabilities (TLS, OAuth2, KMS, Vault, Admission Controllers, let’s encrypt or similar technologies).
  • Very good understanding of Microservice architectures and experience with API Management with Apigee or WSO2
  • Experience in software development in at least one language (Java, JavaScript, Python, Go)
  • Good Knowledge of the Software Development Life Cycle processes based on related tools such as
  • TeamCity, BitBucket, Artifactory
  • SonarQube, VeraCode, Crucible
  • JIRA, Confluence, Service Now
Responsibilities

YOUR KEY RESPONSIBILITIES

  • As Lead Site Reliability Engineer you
  • Orchestrate and contribute SRE activities across API Platforms and Integration services
  • Introduce all engineering disciplines that combine software- and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems
  • Implement the core of DevOps with specific principles and practices, focusing on “what” and “how” to improve reliability
  • Establish and support capacity planning procedures and have a close eye on SLIs and SLOs for production readiness and in live environment
  • Coordinate with the rest of the division and the teams working on different layers of the application and infrastructure, and you have full commitment to collaboration on problem solving
  • For Infrastructure & Service Management you
  • Engage in and improve the whole lifecycle of services - from inception and design, deployment, operation, and refinement
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
  • Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity
  • Develop and enforce policies, standards and guidelines for site reliability
  • Automate application and infrastructure deployment activities to production environments
  • For Incident & Problem Management you
  • Perform troubleshooting & Emergency Response
  • Investigate root causes and suggest solutions
  • Increase the productivity by leading blameless post-mortems
  • For Application Maintenance you
  • Collaboratively work with Product Owners and Engineers to run reliable services
  • Configure and maintains application & monitoring
  • Identify business objects for monitoring
  • Track system performance, capacity, and use your experience to create effective strategies for maintaining and improving system performance and availability
  • For Operational Continuous Improvement you
  • Identify issues and optimization potential and introduce related user stories
  • Support with automation knowhow to reduce the risk of bad changes
  • Identify, design, develop, deploy tools and processes to monitor, maintain, and report site performance and availability
  • For Service Onboarding you
  • Support your Squad and your Chapter population in onboarding & promotions

As a Lead Site Reliability Engineer, you will be responsible for the SRE activities across platforms, portals and enabling services together with other SREs and engineers.

  • > You love this job but feel you cannot tick 100% of the boxes? Send us your CV anyway
Loading...