System Reliability Engineer, Consultant at AIA

Kuala Lumpur, Kuala Lumpur, Malaysia -

Full Time

Start Date

Immediate

Expiry Date

08 Jun, 26

Salary

0.0

Posted On

10 Mar, 26

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

System Reliability Engineering, Software Engineering Principles, Automation, Monitoring, Dynatrace, Root Cause Analysis, Observability, CI/CD, AWS, Azure, Docker, Kubernetes, Python, Bash, JavaScript, PCI-DSS

Industry

Insurance

Description

At AIA we’ve started an exciting movement to create a healthier, more sustainable future for everyone. As pioneering innovators for over 100 years, we’re now transforming our organisation to be faster, simpler and more connected. Because we want to be even better equipped to develop digital solutions and experiences that help more people live Healthier, Longer, Better Lives. To get there, we need people with tech/digital/analytics expertise and passion to help develop positive, sustainable change through digitally enhanced experiences that will impact the lives of millions of people and create a healthier future for everyone. If you believe in developing a better tomorrow, read on. About the Role We are looking for a System / Site Reliability Engineer (SRE) to help ensure the reliability, scalability, and performance of our enterprise systems and services. In this role, you will apply software engineering principles to operations, partner closely with development and infrastructure teams, and build automation that strengthens system stability and efficiency. You will play a pivotal role in bridging the gap between software development and IT operations, driving a culture of resilience, observability, automation, and proactive problem‑solving. Key Responsibilities 1. Ensure System Reliability & Availability Monitor and report on application performance, and highlight any deviations or issues. Collaborate with application engineers and developers to identify root causes and implement durable fixes. 2. Incident Management & Root Cause Analysis Participate as a Subject Matter Advisor during production incidents and outages. Provide insights backed by system monitoring, code review, and database analysis. Support post‑mortem reviews and drive follow‑up actions. 3. Automation & Tooling Automate operational tasks such as monitoring, alerts, and recovery processes. Build scripts and internal tools to eliminate manual toil and improve operational efficiency. 4. Monitoring & Observability Implement telemetry and observability practices to track system health, latency, and error rates. Manage the Dynatrace platform and its integrations with application services. Support teams in designing dashboards and visualization setups. 5. Security & Compliance Work with Security teams to ensure systems comply with regulatory and industry standards (e.g., PCI‑DSS, GDPR). Implement necessary access controls, encryption, and audit capabilities within SRE scope. 6. Capacity Planning & Performance Optimization Analyze usage trends to forecast demand and support scaling decisions. Contribute to cost‑performance optimization efforts across infrastructure and applications. Collaborate closely with development, QA, and infrastructure teams to embed reliability into the SDLC. 7. Documentation & Knowledge Sharing Maintain clear and up‑to‑date operational documentation, runbooks, and architecture diagrams. Champion SRE principles across the organization to foster resilience and accountability. Job Requirements Education Bachelor’s degree in Computer Science, Software Engineering, IT, or related fields. Experience 3–5 years of experience in SRE, DevOps, or Software Engineering roles. Experience supporting front‑end applications in production environments, ideally within financial services or other regulated industries. Technical Skills Strong understanding of front‑end performance monitoring and instrumentation. Hands‑on experience with Real User Monitoring (RUM), Synthetic Monitoring, and APM tools (e.g., Dynatrace, New Relic, Datadog). Proficiency in building dashboards and alerts using Dynatrace, Grafana, Prometheus, Elastic Stack, or Splunk. Familiarity with OpenTelemetry for distributed tracing. Scripting skills in Python, Bash, or JavaScript. Experience with CI/CD pipelines (e.g., GitHub Flow). Practical experience with cloud technologies (AWS or Azure). Knowledge of Docker and Kubernetes. Understanding of secure coding practices for front‑end applications. Awareness of financial compliance standards such as PCI‑DSS. Why Join Us? Be part of a high‑impact team shaping system resilience across the enterprise. Work with modern observability and automation technologies. Influence engineering culture through SRE best practices. Opportunities to innovate and drive real improvements in system reliability. Build a career with us as we help our customers and the community live Healthier, Longer, Better Lives. You must provide all requested information, including Personal Data, to be considered for this career opportunity. Failure to provide such information may influence the processing and outcome of your application. You are responsible for ensuring that the information you submit is accurate and up-to-date. At AIA we’ve started an exciting movement to create a healthier, more sustainable future for everyone. It's about finding new ways to not only better people's lives, but to better the communities and environments we live in. As the largest listed company on the Hong Kong Stock Exchange, we’ve been proudly making a difference for people and communities across Asia for over a century. And we build on this every day with our ambition to engage one billion people to live Healthier, Longer, Better Lives by 2030. If you work at AIA, you play an important part in this movement. Which is why we give you every opportunity to learn, grow and shape your career - your way. Inspiring and supporting you to thrive - not just at work, but in life. Believe in better with AIA. View our AIA LinkedIn page Bring your difference to AIA

Responsibilities

The System Reliability Engineer will ensure the reliability, scalability, and performance of enterprise systems by applying software engineering principles to operations and building automation to strengthen stability. Key duties include managing incidents, performing root cause analysis, implementing observability practices, and collaborating with development teams to embed reliability into the SDLC.