Senior Site Reliability Engineer at KMS Technology
Guadalajara, jalisco, Mexico -
Full Time


Start Date

Immediate

Expiry Date

24 Jun, 26

Salary

0.0

Posted On

26 Mar, 26

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

GCP, Kubernetes, GKE, Terraform, Pulumi, Node.js, SLO/SLI Definition, Prometheus, Grafana, Incident Management, MLOps, SOC2 Compliance, GDPR Compliance, Python, Go, GitHub Actions

Industry

Software Development

Description
Company Description At KMS Technology Mexico, we are passionate about building innovative software solutions that drive impact. As part of an international tech company, we offer a collaborative and inclusive environment where your ideas matter and your growth is our priority. Job Description We are looking for a Senior SRE to join our core engineering team in building the next generation of AI-powered property intelligence for the insurance industry. In this role, you will be the guardian of a platform’s availability, latency, and performance. You will work at the heart of a high-demand ecosystem, ensuring that our Node.js microservices and AI/ML pipelines running on Google Cloud Platform (GCP) are resilient, scalable, and secure. This is a "Software Engineering approach to Operations" role, where automation is the default and manual intervention is a last resort. Key Responsibilities Infrastructure & Platform Engineering Cloud Architecture: Design and manage scalable, multi-regional infrastructure on GCP, leveraging GKE (Kubernetes), Cloud Run, and Pub/Sub. Infrastructure as Code (IaC): Maintain and evolve our infrastructure codebase using Terraform or Pulumi, ensuring environment parity across Staging and Production. Node.js Optimization: Partner with Fullstack teams to tune Node.js application performance, managing memory limits, event loop bottlenecks, and asynchronous execution in a containerized environment. Observability & Reliability SLO/SLI Definition: Define and monitor Service Level Indicators (SLIs) and Objectives (SLOs) to measure the "health" of our property intelligence engine. Advanced Monitoring: Build comprehensive dashboards and alerting systems using Google Cloud Operations Suite (Stackdriver), Prometheus, or Grafana. Incident Management: Lead Root Cause Analysis (RCA) for production incidents and implement "Blameless Post-mortems" to prevent recurrence. AI & Data Operations MLOps Integration: Support the scaling of AI models by optimizing GPU/TPU utilization and data ingestion pipelines within GCP. Security & Compliance: Ensure the platform meets the rigorous data privacy standards of the insurance industry, including SOC2 and GDPR compliance. Qualifications Technical Requirements: 5+ years in an SRE, DevOps, or System Architecture role. GCP Expertise: Deep experience with Google Cloud Platform, specifically GKE, IAM, Cloud SQL, and VPC networking. Coding Proficiency: Strong experience with Node.js (backend services) and scripting in Python or Go for automation. Orchestration: Expert-level knowledge of Kubernetes (GKE), including Helm charts and service meshes (Istio/Anthos). CI/CD: Experience building high-frequency deployment pipelines with GitHub Actions, GitLab CI, or Google Cloud Build. Professional Competencies: The "SRE Mindset": A passion for automation and a visceral dislike of repetitive manual tasks ("Toil"). Strategic Communication: Ability to translate complex infrastructure risks into business impact for Stakeholders and Delivery Directors. AI-First Workflow: Proactive use of AI tools for log anomaly detection, predictive scaling, and automated troubleshooting. Additional Information Location: Guadalajara, Jalisco, Mexico (Hybrid) Benefits and Perks Perks you enjoy at KMS Mexico Mexican law benefits 15 days of PTO (in year zero, from the first year onwards it is 3 days per year). 5 days' leave for the death of immediate family members, negotiable. Major Medical Expenses Insurance with coverage for immediate dependents (spouse and children). Annual performance bonus (≈10% of annualized salary). Annual salary adjustment. Employee Referral Bonus. Paid Certifications / Courses Coursera License. 5% Savings Fund. 5% Grocery Vouchers.
Responsibilities
The Senior SRE will be responsible for ensuring the availability, latency, and performance of a platform running Node.js microservices and AI/ML pipelines on GCP, focusing on infrastructure and platform engineering using Infrastructure as Code.
Loading...