Senior Site Reliability Engineer at Recorded Future
Somerville, Massachusetts, USA -
Full Time


Start Date

Immediate

Expiry Date

28 Nov, 25

Salary

0.0

Posted On

28 Aug, 25

Experience

2 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Distributed Systems, Rabbitmq, Apache Kafka, Kubernetes, Elasticsearch, Microservices, Architecture

Industry

Information Technology/IT

Description

With 1,000 intelligence professionals, over $300M in sales, and serving over 1,900 clients worldwide, Recorded Future is the world’s most advanced, and largest, intelligence company!

PREFERRED QUALIFICATIONS:

  • Knowledge and experience with Kubernetes.
  • Familiarity with message brokers such as RabbitMQ and Apache Kafka.
  • Experience with NoSQL databases, particularly MongoDB and Elasticsearch.
  • Familiarity with OpenTelemetry
  • Experience with large distributed systems and microservices architecture
  • Experience with CI/CD pipelines.

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

ABOUT THE ROLE

We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join our growing team. In this role, you will be instrumental in ensuring the reliability, scalability, and performance of our critical systems. You will work closely with development teams to build and maintain robust infrastructure, implement automation, and foster a culture of operational excellence. This position requires a strong understanding of cloud environments, observability, and infrastructure as code principles.

WHAT YOU’LL DO:

  • Design, implement, and maintain scalable and reliable infrastructure on AWS.
  • Develop and manage observability solutions using tools such as Grafana, ELK (Elasticsearch, Logstash, Kibana), and Prometheus to monitor system health and performance.
  • Automate infrastructure provisioning and configuration using Terraform and Chef.
  • Participate in a 24/7 on-call rotation to respond to and resolve production incidents.
  • Collaborate with engineering teams to ensure applications are designed for high availability and resilience.
  • Proactively identify and address performance bottlenecks and potential issues.
  • Drive continuous improvement through automation, process optimization, and post-incident reviews.
Loading...