G13 - Operations Support Engineer at FPT Asia Pacific Pte Ltd
Singapore, , Singapore -
Full Time


Start Date

Immediate

Expiry Date

05 Jul, 26

Salary

0.0

Posted On

06 Apr, 26

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

SRE, Production Operations, AWS, Kubernetes, Elastic Cloud, Terraform, Argo, Python, Bash, Go, Observability, Incident Management, Infrastructure as Code, GitOps, Security Compliance, Prometheus

Industry

IT Services and IT Consulting

Description
Responsibilities: Design & own service observability usage model: ensure all service metrics, logs, traces flow into Elastic Cloud (authoritative); maintain dashboards & SLOs; evaluate pragmatic use of CloudWatch, AWS Managed Prometheus / Grafana for supplemental or fallback views. Build proactive, noise‑reduced alerting and incident response playbooks; drive post‑incident RCA & remediation tracking (closure SLA). Optimize service performance (profiling, caching layers, autoscaling heuristics, concurrency tuning) meeting latency & throughput targets. Implement secure supply chain & runtime controls (image scanning, SBOM consumption, secrets management, TLS / mTLS) leveraging shared platform tooling. Curate operational runbooks, golden dashboards, reliability readiness + production readiness checklists. Integrate model / guardrail service telemetry (latency, queue depth, GPU/CPU utilization) into unified Elastic Cloud views. Support compliance & audit evidence collection (access logs, config lineage, change histories) via automated evidence capture fed into Elastic. Introduce configuration drift detection & policy-as-code guardrails (OPA / Kyverno) at the workload / namespace layer to enforce baseline controls. Mentor engineers on production readiness, observability patterns, and operational excellence; evolve on-call playbooks. Participate in (and improve) an equitable on-call rotation focusing on sustainable alert volumes & burnout prevention. Requirements 4+ years (or equivalent impact) in SRE / Production Ops / Platform / Reliability for SaaS or high-throughput services. Working knowledge of AWS & Kubernetes (deployment, troubleshooting, networking concepts) sufficient to collaborate effectively with platform owners (not necessarily owning cluster upgrade orchestration). Familiarity with Infrastructure as Code & GitOps (Terraform, Argo, etc.) to consume modules, review changes, and enforce policy. Observability implementation & usage (metrics, logs, traces, profiling) with Elastic Cloud; understanding of Prometheus / OpenTelemetry concepts. Proven on-call & incident management experience (triage, MTTR reduction, RCA authorship). Scripting / automation in Python, Bash, or Go for ops tooling. Security & compliance aware: vulnerability management, image scanning, supply chain controls. Clear, concise communication of operational risk & trade-offs to technical + non-technical stakeholders.
Responsibilities
The Operations Support Engineer will design and maintain service observability models, including metrics, logs, and traces using Elastic Cloud. They will also optimize service performance, manage incident response playbooks, and enforce security and compliance controls across the infrastructure.
Loading...