Systems Reliability Engineer (Full-remote) at Noesis
Lisboa, Área Metropolitana de Lisboa, Portugal -
Full Time


Start Date

Immediate

Expiry Date

18 Jun, 25

Salary

0.0

Posted On

19 Mar, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

C++, Java, Automation, Automation Tools, Python, Infrastructure, Communication Skills, Distributed Systems, Production Systems

Industry

Information Technology/IT

Description

DESCRIPTION:

Noesis is looking for candidates with the following profile:

REQUIREMENTS:

  • BSc, MSc, in Software Engineering/Computer Science or related fields;
  • 2+ years of experience in a similar role or experience as a senior systems administrator;
  • Proficiency in at least one high-level programming language (C++, Python, Java, C#, etc.).
  • Experience with automation tools;
  • Experience with Grafana, ELK stack, Prometheus, or others;
  • Strong troubleshooting and debugging skills.
  • Strong understanding of designing resilient systems;
  • Expertise in debugging complex distributed systems.
  • Fluency in English and excellent communication skills.
  • Participate in on-call rotation to provide 24/7 support for production systems, with “Follow the Sun”

EXPERIENCE IN ANY OF THE FOLLOWING IS VALUED, BUT NOT FULLY REQUIRED:

  • Containerization technologies and orchestration platforms, mainly Kubernetes and EKS (CKA, CKAD, CKS certifications are valued);
  • Familiarity with AWS services;
  • Experience with automation and Infrastructure as Code (IaC) tools, such as AWS CloudFormation, Terraform, etc;
  • If you meet these conditions and would like to join an innovative organization that continuously invests in training its talents, send us your application.
Responsibilities
  • Lead and onboard services and teams to the reliability tenets;
  • Establish and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs);
  • Design and implement scalable, reliable, and secure infrastructure, while ensuring cloud-native best practices;
  • Collaborate with software development teams to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant;
  • Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents;
  • Lead incident response efforts, ensuring quick resolution and minimal downtime, and conduct RCA/post-mortems;
  • Automate every operational task, with a special focus on fast incident detection & recovery;
  • Foster a culture of continuous improvement and knowledge sharing;
  • Communicate effectively with stakeholders, providing updates on system reliability and performance;
Loading...