Site Reliability Engineer
at Reach Digital Health
Cape Town, Western Cape, South Africa -
Start Date | Expiry Date | Salary | Posted On | Experience | Skills | Telecommute | Sponsor Visa |
---|---|---|---|---|---|---|---|
Immediate | 26 Nov, 2024 | Not Specified | 29 Aug, 2024 | N/A | Puppet,Testing,Mysql,Benchmarking,Documentation,Operating Systems,Web Servers,Kubernetes,Reliability Engineering,Aws,C++,Ruby,Azure,Java,Git,Perl,Capacity Planning,Scripting Languages,Programming Languages,Load Testing,Performance Engineering,Redis | No | No |
Required Visa Status:
Citizen | GC |
US Citizen | Student Visa |
H1B | CPT |
OPT | H4 Spouse of H1B |
GC Green Card |
Employment Type:
Full Time | Part Time |
Permanent | Independent - 1099 |
Contract – W2 | C2H Independent |
C2H W2 | Contract – Corp 2 Corp |
Contract to Hire – Corp 2 Corp |
Description:
JOB OVERVIEW
The Site Reliability Engineer (SRE) will apply software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.
You’ll be exposed to unique challenges assisting with the maintenance and stability of our global and local infrastructure, and have the opportunity to contribute to internal and external open source projects. We believe in choosing the right tools for the job, and support creativity in solving problems.
QUALIFICATIONS
- A bachelor’s degree in Computer Science, Engineering or related field, or equivalent experience.
SKILLS AND EXPERIENCE REQUIRED
- Proficient in one or more programming languages, such as Python, Go, Java, or C++.
- Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
- Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
- Proficient in one or more UNIX-like operating systems.
- Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
- Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
- Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
- Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
- Proficient in one or more version control and collaboration tools, such as Git.
- Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
- Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
- Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
- Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
- Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
- Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
- Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
Responsibilities:
RESPONSIBILITIES AND DUTIES
Your primary responsibilities will include but not be limited to:
- Assisting with resources to facilitate engineering services, and keep them operational. This includes continuous integration systems, software deployment and basic troubleshooting of code, and creation and management of software repositories.
- Ensuring servers are patched against security exploits in time, managing secure access to servers and repositories for partners and internal staff, and secure interconnection between systems.
- Ensuring servers are configured in a documented and repeatable way.
- Ensuring system and server architecture is appropriate to the requirements of projects, easily maintainable in the long term, and provides appropriate levels of redundancy.
- Provide timeous uptime assurance, and support with issue investigation and recovery procedures.
- Design and develop tools and software that automate and improve the infrastructure and operation of our systems, ensuring adoption of best practices.
- Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
- General support (problems, password changes, etc) of office infrastructure and services such as Google Workspace, Slack, and PPM Pro.
- Site load testing, unit testing, disaster recovery testing, and quality assurance on a system level including backend performance, deployment sanity, security, scalability and stability.
- Providing data security expertise within SRE and supporting the organisation and projects with compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
- Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
- Commit to writing software that allows itself to be tested.
- Work well within cross functional teams in order to produce world class products and programmes that empower end users.
You will primarily be responsible for:
- Infrastructure reliability and performance:
- Monitoring, measuring, and improving the reliability and performance of our systems
- Maintenance, upgrades, and security updates
- Automation and tooling:
- Designing and developing software and scripts that automate and streamline various aspects of infrastructure and operations
- Assisting other teams with deployment and updates of their applications and services
- Supporting with internal management of the organisation’s technological infrastructure
- Data Management & Security:
- Working with SRE, Data Security, Legal and project team to develop and enforce policies and procedures for data collection, storage, and access to ensuring compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery
REQUIREMENT SUMMARY
Min:N/AMax:5.0 year(s)
Information Technology/IT
IT Software - Network Administration / Security
Software Engineering
Graduate
Computer science engineering or related field or equivalent experience
Proficient
1
Cape Town, Western Cape, South Africa