Lead Site Reliability Engineer at KMS Technology
Guadalajara, jalisco, Mexico -
Full Time


Start Date

Immediate

Expiry Date

26 Jun, 26

Salary

0.0

Posted On

28 Mar, 26

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Azure, AKS, Terraform, Bicep, Java, JVM Performance Tuning, SLOs, SLAs, Application Insights, ELK, Grafana, Root Cause Analysis, SOC2, HIPAA, ISO 27001, Kubernetes

Industry

Software Development

Description
Company Description At KMS Technology, we are dedicated to delivering cutting-edge solutions and services that empower businesses to achieve their goals. Our team is composed of highly skilled professionals who are passionate about technology and innovation. We provide a dynamic and collaborative work environment where you can grow your career and make a significant impact. Job Description We are seeking a Lead Site Reliability Engineer to spearhead the reliability, scalability, and performance of our AI-powered property intelligence platform. Operating at the intersection of Geospatial AI and Insurance Technology, you will be responsible for a mission-critical Azure ecosystem supporting high-throughput Java microservices. As a Lead, you will bridge the gap between complex AI model inference and enterprise-grade stability. You will own the "Production Excellence" mandate, mentoring a team of engineers and collaborating with Senior Delivery Directors to ensure our global infrastructure stays ahead of our rapid growth. Key Responsibilities Strategic Infrastructure & Azure Leadership Cloud Architecture: Lead the design of highly available, multi-region architectures on Azure, utilizing AKS (Azure Kubernetes Service), Azure Functions, and Service Bus. IaC Governance: Establish and enforce standards for Infrastructure as Code using Terraform or Bicep, ensuring 100% automated provisioning across all environments. Java Performance Engineering: Partner with Backend squads to optimize JVM performance, garbage collection tuning, and memory management for high-concurrency insurance processing. Reliability & AI Operations (AIOps) Error Budgeting: Define, negotiate, and manage SLIs, SLOs, and SLAs with Product Stakeholders, balancing the velocity of AI feature releases with system stability. Advanced Observability: Architect end-to-end monitoring and distributed tracing using Azure Monitor, Application Insights, and ELK/Grafana. Incident Commander: Act as the ultimate escalation point for high-priority incidents, leading complex Root Cause Analysis (RCA) and driving long-term remediation tasks. Security & Industry Compliance Data Sovereignty: Ensure the platform adheres to insurance-specific data residency requirements and security frameworks (SOC2, HIPAA, or ISO 27001). Automated Governance: Implement Azure Policy and automated security scanning within CI/CD pipelines to ensure a "Secure by Design" infrastructure. Qualifications Technical Leadership: 7+ years in SRE, DevOps, or Cloud Engineering, with at least 2 years in a Lead or Principal capacity. Azure Mastery: Expert-level knowledge of the Azure Well-Architected Framework, specifically around networking (VNet/ExpressRoute) and Compute. Java Ecosystem: Deep proficiency in the Java/Spring Boot stack from an operational perspective (JVM profiling, thread dump analysis). Container Orchestration: Mastery of Kubernetes (AKS), including ingress controllers, service mesh (Istio), and cluster security. Professional Competencies: Strategic Mindset: Ability to translate technical debt and reliability risks into a data-driven business case for leadership. Automation Advocate: Proven track record of eliminating "Toil" through Python, Go, or Java-based automation tooling. Mentorship: Passion for leveling up the engineering organization through workshops, documentation, and pair programming. AI-First Integration: Experience leveraging AI for predictive scaling and automated log summarization to reduce Mean Time to Recovery (MTTR). Additional Information Perks you enjoy at KMS Mexico Mexican law benefits 15 days of PTO (in year zero, from the first year onwards it is 3 days per year). 5 days' leave for the death of immediate family members, negotiable. Major Medical Expenses Insurance with coverage for immediate dependents (spouse and children). Annual performance bonus (≈10% of annualized salary). Annual salary adjustment. Employee Referral Bonus. Paid Certifications / Courses Coursera License. 5% Savings Fund. 5% Grocery Vouchers.
Responsibilities
This role involves leading the design of highly available, multi-region Azure architectures using AKS and IaC governance via Terraform or Bicep, while also defining and managing SLIs/SLOs to balance feature velocity with system stability.
Loading...