Site Reliability Engineer at TRINET USA INC

Hyderabad, Telangana, India -

Full Time

Start Date

Immediate

Expiry Date

14 Apr, 26

Salary

0.0

Posted On

14 Jan, 26

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Site Reliability Engineering, System Administration, Infrastructure Management, Scripting Languages, Python, Bash, PowerShell, Configuration Management, Ansible, Puppet, Chef, Cloud Platforms, AWS, Azure, GCP, Monitoring Tools

Industry

Human Resources Services

Description

Studies have shown that many potential applicants discourage themselves from applying to jobs unless they meet every single requirement. So if you're excited about this role but your past experience doesn't align perfectly with every single qualification in the job description, nobody's perfect - and we encourage you to apply. You may just be the right candidate for this or other roles. Bachelor's Degree or equivalent experience Typically 2+ years of relevant work experience in Site Reliability Engineering, system administration, or infrastructure management. Strong understanding of SRE principles, practices, and methodologies. Proficiency in scripting languages such as Python, Bash, or PowerShell. Familiarity with configuration management tools like Ansible, Puppet, or Chef. Experience with cloud platforms such as AWS, Azure, or GCP. Knowledge of containerization technologies like Docker and orchestration tools like Kubernetes is a plus. Understanding of networking concepts, load balancing, and distributed systems. Experience with monitoring and observability tools like Prometheus, Grafana, or ELK stack. Excellent problem-solving and troubleshooting skills. Strong attention to detail and the ability to work efficiently in a fast-paced environment. Effective communication and collaboration skills, with the ability to work well in a team. System Monitoring and Incident Response: Monitor system health, proactively detect issues, and respond to incidents in a timely manner. Participate in incident response activities, including triage, troubleshooting, and resolution, ensuring minimal disruption to services. Automation and Tooling: Develop and maintain automation scripts, tools, and utilities to streamline operational tasks, reduce manual effort, and improve system efficiency. Leverage scripting languages and configuration management tools to automate routine tasks. Performance Optimization: Identify performance bottlenecks, analyze system metrics, and optimize system performance. Collaborate with Development and Operations teams to implement performance tuning measures and ensure optimal resource utilization. Infrastructure and Configuration Management: Manage infrastructure resources, including cloud platforms, servers, and network devices. Implement and maintain configuration management practices to ensure consistency and reliability across environments. Capacity Planning: Conduct capacity planning exercises to forecast resource requirements and support scalability. Analyze usage patterns, monitor system performance, and recommend infrastructure adjustments to meet demand. Incident Analysis and Post-Mortems: Perform root cause analysis for incidents and contribute to post-incident reviews. Identify areas for improvement, implement preventive measures, and update documentation and runbooks accordingly. System Documentation: Contribute to the development and maintenance of system documentation, runbooks, and standard operating procedures (SOPs). Ensure documentation is accurate, up-to-date, and accessible to the team. Collaboration and Communication: Collaborate effectively with cross-functional teams, including Development, Operations, and Support, to address system issues, implement changes, and improve system reliability. Communicate updates, findings, and recommendations to stakeholders in a clear and concise manner. Continuous Improvement: Identify opportunities for automation, process enhancements, and tooling improvements. Drive initiatives to optimize system reliability, streamline workflows, and improve operational efficiency. Security and Compliance: Collaborate with Security and Compliance teams to ensure adherence to security best practices, regulations, and standards. Participate in security assessments, vulnerability management, and risk mitigation efforts. Performs other duties as assigned Complies with all policies and standards Work in a clean, pleasant, and comfortable office work setting.

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

The Site Reliability Engineer will monitor system health, proactively detect issues, and respond to incidents. They will also develop and maintain automation scripts and tools to streamline operational tasks.