Site Reliability Engineer, Digital Transformation

at  Harvard University

Boston, Massachusetts, USA -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate28 Sep, 2024Not Specified28 Jun, 20244 year(s) or aboveJenkins,Infrastructure,Python,Containerization,Automation Tools,Security,Code,Aws,Computer Science,Puppet,Ansible,Versioning,Secondary Education,Critical SystemsNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

POSITION DESCRIPTION

Be a pioneer in business, education, and global impact by joining the Harvard Business School HBS) Digital Transformation team - a “startup with assets,” where you will have the chance to deploy digital- and emerging-technology education solutions. Where else can you make a difference at the intersection of cutting-edge technology, world-class education, noble purpose, and timeless legacy?
We are building educational and research solutions powered by Generative AI (GenAI) that scale across hundreds of courses and to hundreds of thousands of users. Our products assist educators and students alike with intelligent, adaptive capabilities that make education more accessible, engaging, and effective.
As a Site Reliability Engineer at HBS you will play a crucial role in ensuring the high availability, performance, security, and scalability of our cloud-based generative AI products. You will work closely with DTx’ data science and machine learning engineering teams to build and maintain robust, efficient, and reliable AI products on the AWS platform. You will work at the intersection of software engineering and systems engineering to build and run large-scale, fault-tolerant systems that balance speed of deployment with stability and operating at peak efficiency while also managing costs.

Responsibilities Include:

  • Design, implement, and maintain scalable, reliable, and efficient systems on the AWS platform.
  • Automate the deployment, scaling, and management of applications using AWS services such as EC2, S3, RDS, Lambda, CloudFormation, etc.
  • Monitor system performance, troubleshoot issues, and implement solutions to ensure optimal operation and uptime.
  • Implement solutions that enable running multiple GenAI workflows using shared infrastructure, while ensuring high throughput, low latency, and speed of deployment.
  • Provide a platform for machine learning (and other exciting workloads) allowing developers to move quickly and experiment.
  • Collaborate with development teams to optimize applications for the cloud and implement best practices for cloud-native development.
  • Implement and manage continuous integration and deployment (CI/CD) pipelines.
  • Develop and maintain disaster recovery plans and conduct regular system backups.
  • Ensure security compliance and best practices throughout the AWS infrastructure.
  • Document system configurations, processes, and procedures.
  • Develop runbooks and recipes for on-call support as part of a rotation schedule to resolve critical issues outside of regular business hours.
  • Adhere to standard methodologies in architectural design, testing (unit, integration, visual, and regression), and scrum methodology.
  • Evaluate developer platform designs, technical decisions, and code to ensure all are high quality, efficient, and well documented.
  • Develop and lead all aspects of Container Orchestration Platform, a diverse ecosystem of multiple applications.
  • Complete other responsibilities as assigned.

BASIC QUALIFICATIONS

  • Minimum of five years’ post-secondary education or relevant work experience

ADDITIONAL QUALIFICATIONS AND SKILLS

Required Qualifications:

  • Bachelor’s degree in computer science or a related technical field, or equivalent combination of education and experience.
  • 5+ years of experience developing and operating mission-critical systems as a Site Reliability Engineer, Sr. DevOps Engineer, or related role.

Additional Preferred Qualifications:

  • Excellent understanding of Linux configuration and administration.
  • Strong experience with Python.
  • 4+ years of experience in software engineering, with a proven understanding of containerization and Infrastructure as Code.
  • Experience with automation tools (Terraform, Ansible, Puppet) and CI/CD pipelines.
  • Familiarity with monitoring and observability tools (Prometheus, Splunk, Grafana, ELK stack).
  • Familiarity with production-level Generative AI workflows, such as retrieval augmented generation, model deployment, versioning, evaluation pipelines, etc.
  • Strong understanding of network protocols and security.
  • Extensive knowledge and hands-on experience in AWS Cloud infrastructure and Services, including CI/CD and IaC provisioning tools such as Jenkins, ArgoCD, Scalr, Terraform and Github Actions.
  • Experience in AWS and familiarity with running containerized services.
  • Knowledge of best practices in observability and monitoring for Docker or Kubernetes clusters at scale with experience in cost optimization tools.

ABOUT US

Founded in 1908 as part of Harvard University, Harvard Business School (www.hbs.edu) is located on a 40-acre campus in Boston. The school offers two full-time MBA and PhD programs, more than 175 Executive Education programs, and certificates and courses through Harvard Business School Online. For more than a century, Harvard Business School faculty have drawn on their research, connection to practice, global expertise, and passion for teaching to educate leaders who make a difference in the world. The school and its curriculum attract the boldest thinkers and the most collaborative learners who will shape the practice of business and entrepreneurship around the globe.

Responsibilities:

  • Design, implement, and maintain scalable, reliable, and efficient systems on the AWS platform.
  • Automate the deployment, scaling, and management of applications using AWS services such as EC2, S3, RDS, Lambda, CloudFormation, etc.
  • Monitor system performance, troubleshoot issues, and implement solutions to ensure optimal operation and uptime.
  • Implement solutions that enable running multiple GenAI workflows using shared infrastructure, while ensuring high throughput, low latency, and speed of deployment.
  • Provide a platform for machine learning (and other exciting workloads) allowing developers to move quickly and experiment.
  • Collaborate with development teams to optimize applications for the cloud and implement best practices for cloud-native development.
  • Implement and manage continuous integration and deployment (CI/CD) pipelines.
  • Develop and maintain disaster recovery plans and conduct regular system backups.
  • Ensure security compliance and best practices throughout the AWS infrastructure.
  • Document system configurations, processes, and procedures.
  • Develop runbooks and recipes for on-call support as part of a rotation schedule to resolve critical issues outside of regular business hours.
  • Adhere to standard methodologies in architectural design, testing (unit, integration, visual, and regression), and scrum methodology.
  • Evaluate developer platform designs, technical decisions, and code to ensure all are high quality, efficient, and well documented.
  • Develop and lead all aspects of Container Orchestration Platform, a diverse ecosystem of multiple applications.
  • Complete other responsibilities as assigned


REQUIREMENT SUMMARY

Min:4.0Max:5.0 year(s)

Education Management

IT Software - System Programming

Education

Diploma

Proficient

1

Boston, MA, USA