Lead Site Reliability Engineer at Wells Fargo

Charlotte, North Carolina, USA -

Full Time

Start Date

Immediate

Expiry Date

12 Oct, 25

Salary

217200.0

Posted On

13 Jul, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Oracle, Operations, Groups, Mongodb, Distributed Systems, Databases, Communication Skills, Training, Splunk, Appdynamics

Industry

Information Technology/IT

Description

PAY RANGE

Reflected is the base pay range offered for this position. Pay may vary depending on factors including but not limited to achievements, skills, experience, or work location. The range listed is just one component of the compensation package offered to candidates.
$111,100.00 - $217,200.00

APPLICANTS WITH DISABILITIES

To request a medical accommodation during the application or interview process, visit Disability Inclusion at Wells Fargo .

WELLS FARGO RECRUITMENT AND HIRING REQUIREMENTS:

a. Third-Party recordings are prohibited unless authorized by Wells Fargo.
b. Wells Fargo requires you to directly represent your own experiences during the recruiting and hiring process

Required Qualifications:

5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of Site Reliability Engineering experience or related experience
5+ years of global support including advanced troubleshooting skills to resolve complex production issues
5+ years of resolving complex issues utilizing fundamental understanding of system components
5+ years of experience in tracing system interactions through various tiers

Desired Qualifications:

Strong understanding of the REST APIs
Strong understanding in working of the troubleshooting tools such as Splunk, AppDynamics, and Elastic APM
Strong experience in API Management tools such as Apigee
Working knowledge of databases such as MongoDB, Oracle
Strong foundation in reliability engineering principles and distributed systems behavior
Experience defining and implementing SLOs/SLIs and using them to drive system improvements
Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
Understand modern observability practices and experience implementing and maintaining monitoring solutions such as Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
Strong incident response skills with experience leading incident retrospectives and driving improvements
Excellent problem-solving abilities and experience debugging distributed systems
Track record of successfully automating operations and reducing toil
Strong communication skills with ability to explain complex technical concepts to audiences
Ability to work both independently and collaboratively (in groups) in an energetic environment

Responsibilities

Wells Fargo is seeking a Lead Site Reliability Engineer in Technology as part of Wealth and Investment Management Technology who thinks systematically about reliability, can translate business requirements into technical implementations, and thrives on making complex systems more robust. Learn more about the career areas and lines of business at wellsfargojobs.com .
The Site Reliability Engineering team is fundamental to ensure our platform delivers consistent, reliable service to our client base. This person will be an individual contributor working at the intersection of software engineering and operations, applying engineering principles to infrastructure challenges. This individual will design and implement scalable systems, create observability solutions that offer actionable insights, and develop automation to improve our platform’s reliability.

In this role, you will:

Work alongside developers as well as the business stakeholders and strive to automate the acceptance criteria
Maintain high reliability and availability for software applications
Automate the mundane tasks and avoid human errors
Define SLI (Service level indicator) & SLO (service level objective) by collaborating with Product owners
Lead incident response efforts and post-mortem analysis to prevent future occurrences.
Write incident root cause analysis, find out the core reason behind the issue and prevent it from happening again
Document procedures, best practices and troubleshooting FAQs.
Debug the system and fixing the production related issues.
Escalate / follow-up on permanent fix for development related issues.
Handle complex operational tasks and recommends process and technology changes.
Provide global support including troubleshooting production related issues and performing checkouts.
Lead complex initiatives to develop infrastructure to provide solutions for business applications
Participate in various projects intended to continually improve or upgrade the infrastructure
Evaluate internal and external software solutions which could be leveraged to meet target state architecture goals
Review and analyze high impact outages to ensure the proper processes and procedures are in place to avoid problems in the future
Design, build, deploy and maintain infrastructure solutions through collaborative efforts with the team and third party vendors
Design, code, test, debug and document programs using Agile development practices
Make decisions in technical designs, implementation plans and identify project risks and resource requirements
Direct the daily risk and control flow of operations, focusing on policies, procedures and work standards to ensure success
Recommend courses of action to maintain cost effectiveness and achieve results
Collaborate and consult with peers, colleagues and managers to resolve issues and achieve goals
Interact with customer and vendor

Required Qualifications:

5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of Site Reliability Engineering experience or related experience
5+ years of global support including advanced troubleshooting skills to resolve complex production issues
5+ years of resolving complex issues utilizing fundamental understanding of system components
5+ years of experience in tracing system interactions through various tiers.

Desired Qualifications:

Strong understanding of the REST APIs
Strong understanding in working of the troubleshooting tools such as Splunk, AppDynamics, and Elastic APM
Strong experience in API Management tools such as Apigee
Working knowledge of databases such as MongoDB, Oracle
Strong foundation in reliability engineering principles and distributed systems behavior
Experience defining and implementing SLOs/SLIs and using them to drive system improvements
Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
Understand modern observability practices and experience implementing and maintaining monitoring solutions such as Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
Strong incident response skills with experience leading incident retrospectives and driving improvements
Excellent problem-solving abilities and experience debugging distributed systems
Track record of successfully automating operations and reducing toil
Strong communication skills with ability to explain complex technical concepts to audiences
Ability to work both independently and collaboratively (in groups) in an energetic environment.

Job Expectations:

Ability to work weekends
Participate in on-call rotations to ensure 24/7 system availability and support.