SITE RELIABILITY ENGINEER at Trinity HR Solutions Pte Ltd

Singapore, Southeast, Singapore -

Full Time

Start Date

Immediate

Expiry Date

08 Sep, 25

Salary

9900.0

Posted On

09 Jun, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Kubernetes, Computer Engineering, Continuous Integration, Firewalls, Jenkins, Shell Scripting, Computer Science, Kibana, Aws, Organization Skills

Industry

Information Technology/IT

Description

We are looking for highly motivated individuals who are interested to join our Site Reliability organization for our next generation of Investment and Insurance products as a Site Reliability Engineer. Successful candidates will be working closely with our internal users and development teams to ensure production systems stability in a highly competitive environment.

KEY ACCOUNTABILITIES

Manage, monitor and operate the system to ensure all business functions are running smoothly.
Work across teams to continually review, provide feedback, implement best practices to improve the efficiency of the systems and drive future innovation.
Manage on-going changes while retaining high levels of service availability to our customer base.
Pragmatically identify root cause for production incidents and lead to implement necessary actions to prevent recurrence.
Drive incident management process and support a blameless post-mortem culture.
Automate the system operations to reduce Toil and attain high level of efficiency.

REQUIREMENTS

Bachelor’s Degree/Diploma in Computer Science, Computer Engineering, or Computer Application. Equivalent experience may be considered.
3+ years of experience working in supporting critical applications using API driven technologies
2+ years of hands on experience in Python development (preferably with RESTFUL APIs)
2+ years of working with a modern stack (AWS, PCF, containers, or Kubernetes)
2+ years of Continuous Integration and Continuous Delivery experience through Jenkins or equivalent.
Experience with modern observability tools such as Grafana, Kibana, or Prometheus preferred.
Experience working in an Agile (SAFE or Kanban) environment preferred.
Knowledge and/or experience using SQL and Linux Shell scripting
Basic understanding of firewalls, load balancers, and networking concepts.
Communication skills with all levels and team work spirits are essential.
Proactive with good analytical and organization skills.
Ability to work independently and multi-tasking .

Responsibilities

JOB PURPOSE

Working with our business units, development teams, and many other units to help maintain the high quality and service level objectives of our systems.
Optimize the supportability of systems through automation and applying basic SRE principles such as blameless post-mortem, error budget, and automation.
Provide production support for the application domain when applicable.

RESPONSIBILITIES

Participate in platform operations management and capacity management.
Coordinate and implement platform/infrastructure upgrades and releases with technical and business teams.
React to critical issues immediately - troubleshoot, investigate and apply appropriate solutions to normalise systems operations.
Provide off-hour/weekend support to ensure production systems stability.
Troubleshoot problems across a wide range of technical skills (development, CI/CD, infrastructure, etc)
Maintain awareness of relevant technical and product trends with self-learning and job shadowing.
Create and maintain the operational documents to reflect system changes and upgrades.
Ability to communicate effectively, professionally and comfortably, both verbally and in writing across all levels.