Senior Site Reliability Engineer - Data (REMOTE) at Discogs

Portland, Oregon, USA -

Full Time

Start Date

Immediate

Expiry Date

25 Jul, 25

Salary

140000.0

Posted On

25 Apr, 25

Experience

1 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Code, Ops, Python, Computer Science, Devops, Kafka, Collaboration, Graphql, Scripting, Mysql, Aws, Kubernetes, Rdbms, Cloud Development

Industry

Information Technology/IT

Description

The Discogs Platform team is focused on several objectives: building and supporting performant, cost-effective, reliable infrastructure; developer experience tooling and mentorship; and creating “golden paths” for organization-wide standards and velocity. As a key member of the Platform team, the Senior Site Reliability Engineer - Data will be working closely with other Discogs engineering squads to develop and optimize scalable, well-planned relational database architectures, drive best practices and stability for our use of Kafka and change data capture, and contribute to the Platform team’s operations.

WHO WE ARE

We are dedicated to supporting a global community of music fans and collectors who share the value, culture, connection, and joy of record collecting. Fostering the exchange of knowledge, records, and curation, we help people help each other deepen their relationship with music. Leveraging the power of community, we are committed to enabling people to explore artists and their recorded works through the world’s definitive music discography, stay informed with record collection and sales history data, get organized with specialized collection management tools, and stay connected to a global community of fellow record collectors and sellers. Providing this essential set of resources, tools, and access, we aim to unleash boundless opportunities for people to dig into the depths of their musical interests, build and fortify their record collections, cultivate and bridge communities, and elevate their connection to music and record collecting.

MINIMUM EDUCATION AND EXPERIENCE

A Bachelor’s Degree in Computer Science or similar area of focus, or equivalent relevant work experience.
5+ years of experience working with Kafka and relational database management systems (RDBMS).
6+ years experience in Ops, DevOps, Site Reliability, Platform or other systems roles.

REQUIRED SKILLS & ABILITIES:

Relational database schema design, query performance optimization, administration (MySQL, Percona Server, AWS RDS)
Kafka: Cluster administration (Strimzi), Kafka Connect (Debezium, JDBC)
CI/CD (GitHub Actions)
GitOps (ArgoCD)
Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests)
AWS and cloud development (VPC, EKS, RDS, S3)
Observability (Datadog, Sentry)
Scripting (Shell, Python)
Track record of collaboration and mentorship
Excellent written communication and documentation skills
Continuous learning
Ownership and proactive approach to solving large problems

Preferred:

Infrastructure-as-code (Terraform)
Elasticsearch (ECK administration, scaling, performance)
Python (SQLAlchemy, FastAPI)
GraphQL (schema design, Apollo federation)

How To Apply:

Incase you would like to apply to this job directly from the source, please click here

Responsibilities

Stewarding Discogs’ data stores as a key subject matter expert
Leading efforts on the reliability and design patterns of our Kafka and Kafka Connect implementations
Establishing data contracts and clear communication standards between CDC producers and consumers
Working closely with engineering squads to refactor and re-architect MySQL database schema and indexing for long-term scalability, performance, and cost effectiveness
Mentoring engineering squads on Platform best practices for MySQL, Kafka, and other software development lifecycle areas
Writing documentation and runbooks that contribute to the engineering organization’s knowledge base
Working in a containerized, orchestrated environment
Contributing to the Platform team’s disciplines of site reliability and operations, supporting both our squads and Platform’s central infrastructure
Participating in on-call rotation, responding to incidents, and troubleshooting data and other operations issue