DevOps Engineer / Site-Reliability Engineer at TARDIS GROUP SINGAPORE PTE LTD
Singapore, , Singapore -
Full Time


Start Date

Immediate

Expiry Date

14 Nov, 25

Salary

7000.0

Posted On

15 Aug, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Automation, Computer Science, Logging, Kubernetes, Docker, Aws, Scripting, Mysql, Kafka, Distributed Systems, Elasticsearch, Nginx, Devops, Azure, Cloud, Infrastructure, Python, Redis

Industry

Information Technology/IT

Description

REQUIRED QUALIFICATIONS

Experience & Education

  • 2+ years of hands-on experience in Systems Operations, DevOps, or Site Reliability Engineering (SRE)
  • Bachelor’s degree in Computer Science, Engineering, or related technical field preferred

Cloud & Infrastructure

  • Experience with public cloud platforms (AWS, Azure, or GCP) is highly valued
  • Strong understanding of large-scale internet architecture and distributed systems
  • Proven experience with infrastructure monitoring, logging, and observability tools

Technical Skills

  • Proficiency in scripting and automation using Shell, Python, or similar languages
  • Strong knowledge of containerization technologies (Kubernetes, Docker)
  • Hands-on experience operating production-grade container clusters and managing CI/CD pipelines
  • Strong familiarity with common infrastructure components: Nginx, MySQL, Redis, Kafka, Elasticsearch
Responsibilities

Cluster Operations & Management

  • Manage and maintain container clusters (Kubernetes, Docker) and open-source component clusters (Kafka, Redis, Elasticsearch) across multiple business units
  • Ensure optimal performance, scalability, and reliability of distributed systems

Infrastructure Platform Development

  • Design, build, and enhance infrastructure operation platforms
  • Develop and maintain systems for infrastructure management, CI/CD pipelines, monitoring/alerting, and centralized logging
  • Drive platform standardization and automation initiatives

High Availability & Reliability

  • Ensure maximum uptime for production services through proactive monitoring and incident response
  • Continuously optimize service architecture, deployment strategies, and operational processes
  • Implement and maintain SLA/SLO frameworks and reliability engineering practices

Automation & Process Improvement

  • Lead the development of automated operations and maintenance systems
  • Create self-service tools and workflows to improve team productivity
  • Establish best practices for infrastructure such as code and configuration management
Loading...