Sign up with

Already have an account? Log in here

Need some help?
Talk to us at +91 7670800001

Systems Reliability Engineer (Full-remote) at Noesis

Lisboa, Área Metropolitana de Lisboa, Portugal -

Full Time

Start Date

Immediate

Expiry Date

18 Jun, 25

Salary

0.0

Posted On

19 Mar, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

C++, Java, Automation, Automation Tools, Python, Infrastructure, Communication Skills, Distributed Systems, Production Systems

Industry

Information Technology/IT

Description

DESCRIPTION:

Noesis is looking for candidates with the following profile:

REQUIREMENTS:

BSc, MSc, in Software Engineering/Computer Science or related fields;
2+ years of experience in a similar role or experience as a senior systems administrator;
Proficiency in at least one high-level programming language (C++, Python, Java, C#, etc.).
Experience with automation tools;
Experience with Grafana, ELK stack, Prometheus, or others;
Strong troubleshooting and debugging skills.
Strong understanding of designing resilient systems;
Expertise in debugging complex distributed systems.
Fluency in English and excellent communication skills.
Participate in on-call rotation to provide 24/7 support for production systems, with “Follow the Sun”

EXPERIENCE IN ANY OF THE FOLLOWING IS VALUED, BUT NOT FULLY REQUIRED:

Containerization technologies and orchestration platforms, mainly Kubernetes and EKS (CKA, CKAD, CKS certifications are valued);
Familiarity with AWS services;
Experience with automation and Infrastructure as Code (IaC) tools, such as AWS CloudFormation, Terraform, etc;
If you meet these conditions and would like to join an innovative organization that continuously invests in training its talents, send us your application.

Responsibilities

Lead and onboard services and teams to the reliability tenets;
Establish and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs);
Design and implement scalable, reliable, and secure infrastructure, while ensuring cloud-native best practices;
Collaborate with software development teams to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant;
Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents;
Lead incident response efforts, ensuring quick resolution and minimal downtime, and conduct RCA/post-mortems;
Automate every operational task, with a special focus on fast incident detection & recovery;
Foster a culture of continuous improvement and knowledge sharing;
Communicate effectively with stakeholders, providing updates on system reliability and performance;