Software Engineer, Infra/DevOps at Grammarly Inc

Berlin, Berlin, Germany -

Full Time

Start Date

Immediate

Expiry Date

17 Sep, 25

Salary

0.0

Posted On

17 Jun, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Aws, Reliability, Go, Azure, Customer Value, Infrastructure Solutions, Software Development

Industry

Information Technology/IT

Description

Grammarly offers a dynamic hybrid working model for this role. This flexible approach gives team members the best of both worlds: plenty of focus time along with in-person collaboration that helps foster trust, innovation, and a strong team culture.

THE OPPORTUNITY

To achieve our ambitious goals, we’re looking for a Software Engineer to join our Reliability Engineering team as part of the wider Engineering Platform team. This role will build world-class, secure, and reliable cloud-native infrastructure solutions for Grammarly engineers that will scale with our user base.
Grammarly’s engineers and researchers have the freedom to innovate and uncover breakthroughs—and, in turn, influence our product roadmap. The complexity of our technical challenges is growing rapidly as we scale our interfaces, algorithms, and infrastructure. You can hear more from our team on our technical blog.
As a Software Engineer, Infra you will be a key player in building and enhancing the reliability and observability of our services across the engineering organization. You will be part of a centralized team focused on improving incident management, introducing auto-scaling and resilience mechanisms, conducting chaos testing, and self-healing. Your work will be instrumental in establishing a center of excellence for reliability, establishing and evangelizing best practices, and developing tools to scale these practices across all engineering teams.

In this role, you will:

Use modern infrastructure management tools and services like AWS to build a massively scalable platform for Grammarly’s services.
Be an ambassador for Operational Excellence - building and continually improving incident management tooling and processes.
Implement proactive reliability improvements to reduce manual intervention and increase reliability. This includes automated deployment improvements, canary analysis, self-healing mechanisms, and autoscaling.
Manage cloud-native infrastructure solutions, such as cross-service infrastructure, Kubernetes clusters and deployments, auto-scaling tool sets, and service discovery.
Build solutions and frameworks to spin up, test, deploy, and observe Grammarly’s service reliability.
Participate in on-call incident response and escalation procedures.

QUALIFICATIONS

Has a minimum of 5 years of experience managing live production SaaS environments with high load.
Is experienced in working on a centralized reliability or SRE team configuration.
Hands-on experience with cloud-native infrastructure solutions such as container orchestration and service discovery in Kubernetes-based environments.
Background in software development or engineering roles with a focus on reliability.
Is knowledgeable on all things Reliability and how to scale those solutions across the engineering organization.
Is knowledgeable of AWS —or has deep expertise in Azure or GCP and is willing to learn AWS quickly.
Can deliver maintainable and high-quality code in Go or other languages.
Can communicate well and collaborate effectively, empathetically, and proactively on a tightly integrated team.
Embodies our EAGER values—is ethical, adaptable, gritty, empathetic, and remarkable.
Is inspired by our MOVE principles, which are the blueprint for how things get done at Grammarly: move fast and learn faster, obsess about creating customer value, value impact over activity, and embrace healthy disagreement rooted in trust.

Responsibilities

Use modern infrastructure management tools and services like AWS to build a massively scalable platform for Grammarly’s services.
Be an ambassador for Operational Excellence - building and continually improving incident management tooling and processes.
Implement proactive reliability improvements to reduce manual intervention and increase reliability. This includes automated deployment improvements, canary analysis, self-healing mechanisms, and autoscaling.
Manage cloud-native infrastructure solutions, such as cross-service infrastructure, Kubernetes clusters and deployments, auto-scaling tool sets, and service discovery.
Build solutions and frameworks to spin up, test, deploy, and observe Grammarly’s service reliability.
Participate in on-call incident response and escalation procedures