Staff Site Reliability Engineer at TRINET USA INC

Hyderabad, Telangana, India -

Full Time

Start Date

Immediate

Expiry Date

07 Apr, 26

Salary

0.0

Posted On

07 Jan, 26

Experience

10 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, DevOps, Public Cloud, Container Technologies, Programming Languages, High Availability Planning, Disaster Recovery, Infrastructure as Code, Kubernetes, REST APIs, Security, Monitoring Tools, Automation, Linux/Unix, Microservices, Networking

Industry

Human Resources Services

Description

Studies have shown that many potential applicants discourage themselves from applying to jobs unless they meet every single requirement. So if you're excited about this role but your past experience doesn't align perfectly with every single qualification in the job description, nobody's perfect - and we encourage you to apply. You may just be the right candidate for this or other roles. Bachelor's degree in computer science, Engineering, or related field. 8+ years of relevant experience in SRE/DevOps or similar roles. At least 5 years of experience in public cloud (AWS, Azure etc), and container technologies. Demonstrate strong experience with programing languages like Java, Python. Strong experience on High availability planning, Capacity planning, and Disaster Recovery is required. Technical proficiency: Strong hands-on experience with Ansible or Terraform and building services in AWS, and strong understanding of in-memory data stores such as Redis, Memcached. Experience working with IaC tools like Terraform , Ansible and managing Kubernetes services, including HELM In-depth knowledge of REST APIs, OAuth, OpenID Connect (OIDC), and SAML, with proven experience in implementing secure authentication and authorization mechanisms. Knowledge of various network protocols like IPv4/6 TCP/IP, FTP, SMTP, UDP, SSL and HTTP/HTTPS. Practical understanding of messaging technologies such as ActiveMQ, RabbitMQ etc. Ability to leverage monitoring / logging analytics tools such as Prometheus, Grafana, Splunk and AppDynamics. Ability to architect applications & solutions that are Highly Available, Scalable and Highly fault tolerant. Ability to be cool-headed while troubleshooting Production issues on Incident bridges, ability to focus on problem resolution. Hands on experience with container technologies such as Docker, Kubernetes. Deep understanding of the concepts like microservice architecture, Middleware technologies, Networking, databases, and Observability. Experience In managing large scale distributed web applications In an SRE role/capacity. Should have a security first mind set while designing / architecting solutions. Deep understanding of Linux/Unix operating systems, file systems, administration, and networking. Ability to develop and maintain automation scripts using Ansible, Terraform, Python, and Java. Hands on experience with public cloud technologies (AWS, Azure, and OCI Preferred). Proficient in using configuration management tools like Ansible, Puppet, and Terraform. Extensive experience in deploying and maintaining applications to Kubernetes using Docker, Jenkins, and Git. Experience leveraging various monitoring tools such as Prometheus/Grafana, AppDynamics and CloudWatch to monitor and improve application availability and performance. Ability to create documentation, runbooks, and train Tier I/Tier II teams to support day-to-day operations. Ability to adapt to a fast paced, constantly evolving business and work environment while managing multiple priorities with little supervision. Exceptional debugging and analytical skills. Ability to communicate well and thrive under pressure while collaborating and managing competing demands with tight deadlines. Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing secure, reliable and highly available software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and reliability reviews. - 20% Guides reliability practices through activities including architecture reviews, code reviews, capacity/scaling planning, security vulnerability remediations. Participate with other SRE leaders in setting the enterprise strategy for designing and developing resiliency in the application code. Participates in on-call rotation for the services owned by the SRE team, effectively triaging, and resolving production and development issues. Work in a clean, pleasant, and comfortable office work setting. Cloud Architect Certifications (AWS preferred) Kubernetes Certifications (CKA preferred)

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

The Staff Site Reliability Engineer will guide reliability practices, participate in on-call rotations, and collaborate with engineering teams to ensure services are reliable and secure. They will also be involved in architecture reviews, capacity planning, and incident resolution.