Principal Application Reliability/Site Reliability Engineer

at  International SOS

Trevose, PA 19053, USA -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate16 Sep, 2024Not Specified18 Jun, 202415 year(s) or aboveGood communication skillsNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

International SOS is the world’s leading medical and security services company with over 12,000 employees working in 1,000 locations in 90 countries. We were founded on the principle of putting our clients’ employees first and this is still true today. Led by 5,200 medical professionals and 200 security specialists our teams work night and day to find solutions to protect our clients and their employees in whatever situation they may be facing; we assess, advise and assist from a medical, security and logistical perspective on a global scale to protect and save lives and thereby enable our clients to achieve their business goals. As we’ve delivered on this mission over the last 35 years, we have become the market leader in global telehealth services and digital health solutions for an extensive client base of Fortune 500 companies, NGO’s and governments around the world.

ABOUT YOU:

  • Thorough, detailed, and careful planning, development, and execution
  • Proactively looking for areas to improve
  • Clear communication with all involved parties
  • Calm under pressure
  • Clear sense of ownership and accountability
  • 15+ years of hands-on experience with Windows and Linus operating systems, databases (SQL & Non-SQL)
  • In-depth knowledge and proven record of building and operating highly available, scalable, large-scale enterprise applications on AWS or Azure, or other Open Stack clouds.
  • Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
  • 10+ years of SRE or closely related experiences for large-scale cloud SaaS 10+ years of hands-on technical experiences in DevOp, Release Management Engineering, or similar areas.
  • Strong experience with Monitoring tools: Datadog, Prometheus, Grafana, Cloudwatch, ELK, etc.
  • Extensive knowledge of config management systems
  • Strong programming skills, Net, Java, Python, JavaScript, etc.
  • B.S. in Computer Science or Software Engineering. M.S. in similar fields preferred.
  • Minimal Occasional travel domestically in US.
  • On call Rotation

Responsibilities:

ABOUT THE ROLE:

We are seeking a Principal ARE/SRE to be responsible for keeping all user-facing services and production systems running smoothly. Application/Site Reliability Engineer(s) a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to environments and the codebase.

KEY RESPONSIBILITIES:

  • Be on rotation for availability incidents and provide support for customer service engineers.
  • Proactively develop scripts and tools to prevent incidents from ever happening.
  • Run infrastructure and applications with modern tools and automation like Puppet, Terraform, Kubernetes, etc.
  • Develop a comprehensive monitoring and alerting alert on symptoms and potential issues to prevent outages.
  • Measure and optimize system performance, to push our capabilities forward, get ahead of customer needs, and innovate to continually improve
  • Provide primary operational support and engineering for multiple large distributed software applications
  • Document every action so findings turn into repeatable actions–and then into automation.

KEY RESPONSIBILITIES CONT:

  • Improve the deployment process to make it as smooth and effortless as possible.
  • Design, build and maintain core infrastructure pieces that allow scaling to support enterprise-level of concurrent users.
  • Debug production issues across services and levels of the stack.
  • Plan the growth of infrastructure and capacity planning.
  • Provide technical leadership of the SRE team (internal or through Managed Services)
  • Proactively working with development leads, client service leads, solution architects, and infrastructure leads to enhance system reliability, scalability, and robustness.


REQUIREMENT SUMMARY

Min:15.0Max:20.0 year(s)

Information Technology/IT

IT Software - Other

Software Engineering

Graduate

Proficient

1

Trevose, PA 19053, USA