PW - Sr. SRE B. - Job3730 at Taller Technologies

, , -

Full Time

Start Date

Immediate

Expiry Date

22 Mar, 26

Salary

0.0

Posted On

23 Dec, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, Kubernetes, Cloud Environments, Incident Response, Documentation, CI/CD, Source Code Management, Troubleshooting, Communication, Collaboration, Infrastructure as Code, Security Practices, Containerization, Scripting, Event-Driven Patterns

Industry

IT Services and IT Consulting

Description

PW - Sr. SRE B. - Job3730 Summary We are looking for a seasoned Site Reliability Engineer (SRE) to join our team and support our strategy of driving products and technology to accelerate business growth. As an SRE, you will work alongside a team of problem solvers, helping to solve complex business issues from strategy to execution. Responsibilities Defining standard reliability and resilience for infrastructure and application components. Proactively optimizing redundancies, monitoring practices, and alerting patterns. Developing resilient and highly available distributed systems. Building infrastructure as code tools for cloud environments. Monitoring systems and services, providing incident response to triage and resolve system or client issues. Managing the application ecosystem, improving platform infrastructure and applications with high reliability,resiliency, performance, and quality. Creating documentation, knowledge articles, and runbooks. Designing and implementing SRE patterns that adhere to our client's security guidelines and policies. Requirements Bachelor's degree in Computer Science or related field (or equivalent work experience). At least 4 years of relevant working experience as a Site Reliability Engineer or similar role. Advanced Kubernetes expertise - Strong skills in Kubernetes at scale using AKS, EKS, or GKE. Experience with Kubectl and Helm. Familiarity with tools like Lens or Rancher. Observability: experience in setting up tools like Datadog & Splunk for actionable insights on microservice environments including synthetics, application performance monitoring, logging, and alerting (PagerDuty/OpsGenie integrations). Good CI/CD expertise. Experience using Azure DevOps & GitHub Actions for continuous integration and continuous deployment processes. SCM proficiency - Working with tools like GitHub for source code management, along with experience in branching strategies like GitFlow or trunk-based development. Strong troubleshooting skills - Ability to dive deep into code-level analysis to provide development teams with a head start on resolving application issues. Effective contribution to root cause analysis exercises. Good communication skills - Active listening, verbal and non-verbal communication, clarity, concision, confidence, open-mindedness, and respect. Good documentation skills - Ability to effectively document automation and technical efforts for ease of adaptability of solutions. Collaboration skills - Ability to work effectively with Scrum/Dev teams using a push/pull philosophy, managing expectations and contributing to the stability and improvement of the platform. Nice to Have Infrastructure as Code tools (Terraform, Pulumi). Preferably developed modules in the past rather than just using them. Security practices including encryption at rest/in transit with tools like Azure Key vault, Hashicorp Vault, Google KMS. Containerization experience deploying Java (Spring Boot) microservices in Docker environments. Automation – Must be able to identify toil and opportunities to reduce that within the team. Authentication/Authorization – Familiarity with Authn/Authz schemes like OpenID, OAuth 2.0, SAML. Scripting and Programming – Experience with Python, Powershell, Java or Node. Familiarity with event-driven/event sourcing patterns using platforms like Kafka, EventHub, RabbitMQ and patterns like CQRS.

Responsibilities

The Site Reliability Engineer will define reliability standards for infrastructure and applications, optimize monitoring practices, and develop resilient distributed systems. They will also manage the application ecosystem and create documentation for operational processes.