Resilience Engineer at Vodafone United States

Lisbon, , Portugal -

Full Time

Start Date

Immediate

Expiry Date

09 Apr, 26

Salary

0.0

Posted On

09 Jan, 26

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Resilience Engineering, Automation, Chaos Engineering, Monitoring, Incident Response, Business Continuity, Disaster Recovery, Scripting, Linux, Site Reliability Engineering, Telemetry, Logging, Alerting, Capacity Planning, Security Principles, Fault Detection

Industry

Telecommunications

Description

Developing and governing resilience strategies across system architecture, deployment, monitoring, and incident response; Defining and tracking stability KPIs (e.g., MTTD, MTTR, error budgets), partnering with performance and operations teams to meet or exceed targets; Designing and implementing fault injection testing, chaos engineering practices, and scenario-based simulations to validate platform robustness; Collaborating with product, infrastructure, architecture and development teams to re-design services with built-in redundancy, failover, and graceful degradation; Driving automation and observability improvements to reduce noise, increase fault detection speed, and support predictive failure mitigation; Contributing to the design and maintenance of our Business Continuity and Disaster Recovery Plan (BCDR), ensuring IoT systems remain resilient and recoverable in the face of unexpected disruptions; Owning the resilience roadmap and continuously assessing emerging threats, technologies, and architectural shifts to guide evolution of stability practices; Evangelizing a culture of resilience through internal communication, workshops, and post-incident learning programs; Deliver high-quality engineering solutions while continuously strengthening the resilience, scalability, and cost efficiency of our IoT platform; Consistently meet or exceed delivery expectations by prioritizing the highest-leverage resilience initiatives that improve customer experience, business outcomes, and financial performance; Build trusted, transparent, and outcome-driven relationships by providing clear technical direction and trade-off recommendations to business and engineering stakeholders. Educated to BSc degree level in Software Engineer or related discipline with Computer Science Strong scripting and automation experience (e.g., Python, Bash, Go, PowerShell), with a demonstrated ability to replace manual processes with reliable, scalable automation; Proven experience designing and operating high-availability, fault-tolerant systems, including the use of chaos engineering techniques and proactive failure-mitigation strategies; Experience applying Business Continuity and resilience standards (e.g., ISO 22301) in the context of real-world platform design and operational readiness; Hands-on experience designing or integrating monitoring, alerting, and automated testing frameworks to support early fault detection and system validation; Broad experience working with Linux-based platforms across on-premises and cloud environments, with an understanding of how infrastructure choices impact reliability, scalability, and recovery; Deep expertise in Site Reliability Engineering principles, including SLOs/SLIs, error budgets, observability, toil reduction, and automation, with the ability to apply them at platform and system scale to guide architectural decisions and long-term resilience strategy; Proven ability to balance long-term platform stability with delivery velocity by making clear, data-driven trade-offs; Strong understanding of security principles, practices, and standards, and the ability to incorporate them into resilient, real-world technical solutions; Deep command of telemetry, logging, and alerting ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, Splunk), with the ability to design signals that enable early fault detection and informed decision-making; Experience defining meaningful SLIs and building dashboards that drive architectural insight, prioritization, and corrective action; Proven experience leading blameless post-incident reviews, root cause analysis, and systemic improvements across multiple teams; Expertise in identifying and addressing system bottlenecks, latency issues, and throughput constraints in distributed environments; Proficiency in forecasting demand, planning capacity, and managing system growth in a cost-efficient and sustainable manner; Strong track record of partnering with software engineering, infrastructure, product, and business teams to embed resilience into the full development lifecycle; Fluency in English.

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

The Resilience Engineer will develop and govern resilience strategies across system architecture and incident response, while collaborating with various teams to enhance platform robustness. They will also drive automation improvements and contribute to the Business Continuity and Disaster Recovery Plan.