ROLE OVERVIEW:
As the SRE engineer, you will be accountable & responsible to maintain the appropriate service levels (availability, latency, and reliability) to serve our customers’ needs, and reduce the friction for managing change. Your responsibilities will include engaging with DevOps, Engineering & other teams to understand and support the business needs and initiatives. Every SRE is responsible for the availability, scalability, security, performance, cost, and compliance requirements of our services. You will ensure applications on-boarded to SRE are instrumented for full-stack observability and continuous testing, introduce continuous improvement, integrate into IT Service Operations, and share support responsibilities for critical customer journeys, business flows, and applications.
This is a Hybrid position located in Frisco, TX. You will be required to be onsite on an as-needed basis, typically 1 to 6 times a month. We are only considering candidates within a commutable distance to one of the two locations and are not offering relocation assistance at this time.
ABOUT THE ROLE
- Responsible for proactive monitoring of mission critical production environment and respond quickly in response to breach in trends or issues.
- Troubleshoot, debug, and escalate issues with proper analysis to concerned teams to ensure maximum availability.
- Troubleshoot problems in real-time, interacting with DevOps/Engineering and internal support representatives to deliver maximum customer satisfaction.
- Detect and triage of all operational incidents and requests.
- Work extensively to help reduce the Mean Time to Restore (MTTR) & improve Mean Time To Detect (MTTD)
- Work across Engineering and Support teams to ensure we meet our goals for service reliability, availability, and efficiency.
- Ensure security events and alerts are addressed in a timely manner.
- Own availability and performance of mission critical services. Automation to prevent problem recurrence, and responses to all non-exceptional service conditions.
- Help maintain and improve service operations by following established processes and procedures and periodic update of SOP and documents in confluence page.
- Contribute to day-to-day activities, including Incident Management and Change Management.
- Support automation initiatives to enhance Mean Time to Restore (MTTR) and Mean Time To Detect (MTTD).
- Help track Key Performance Indicators (KPIs) to support operational performance and service reliability.
- Participate in incident retrospectives and assist in managing the incident lifecycle.
- Engage in readiness reviews before changes or deployments into production environments.
- Support product engineering teams on SRE related activities to establish optimal SLAs for all pre-defined activities and provide a high-quality customer experience.
- Provide detail summary of all high priority issues to stakeholders ensuring quality in data provided.
- Participate early in the SDLC to ensure reliability is built in from the beginning and creating plans for successful implementations/launches and transition into SRE team smoothly.
- Create accurate root cause of Production issues and help to provide long term solutions to fix them.
- Continually evaluate and adopt the latest industry technologies to optimize costs and streamline processes.
- Communicate effectively and present team progress to leadership.