Data Engineer at Wikimedia Foundation

London, England, United Kingdom -

Full Time

Start Date

Immediate

Expiry Date

13 Nov, 25

Salary

115334.0

Posted On

13 Aug, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Costa Rica, Cassandra, Working Experience, Teams, Spark, Data Governance, Eor, Contractors, Bangladesh, Python, Kafka, Airflow, Denmark, Kubernetes, Pipelines, Hive, Presto, Color, Sql, Consideration, Development Tools

Industry

Information Technology/IT

Description

DATA ENGINEER

Wikipedia and its sister projects reach billions of users monthly across 300+ languages and are powered by 200,000+ volunteer contributors. The Wikimedia Foundation’s Data Platform team enables this global knowledge ecosystem through robust data capabilities that serve both our internal teams and the public through products like Wikistats, Wikimedia Analytics APIs and Data Snapshots. Our vision is a world in which every single human being can freely share in the sum of all human knowledge– including access to data for research, feature development, and advancing artificial intelligence responsibly.
As a Data Engineer for our Data Platform, you will shape the future of how Wikimedia’s vast data ecosystem serves both our internal teams and the global community. You will contribute the Data Platform Engineering team’s effort to unify data systems across the Wikimedia Foundation to deliver scalable solutions that support the open knowledge movement.

EXPERIENCE

3+ years of data engineering experience, with exposure to on-premise systems (e.g., Spark, Hadoop, HDFS).
Understanding of engineering best practices with a strong emphasis on writing maintainable and reliable code.
Hands-on experience in troubleshooting systems and pipelines for performance and scaling.
Desirable: Exposure to architectural/system design or technical ownership.
Desirable: Experience in data governance, data lineage, and data quality initiatives.

CORE TECHNICAL SKILLS

Working experience with data pipeline tools like Airflow, Kafka, Spark, and Hive.
Proficient in Python or Java/Scala, with working knowledge of development tools and its ecosystem.
Knowledge of SQL and experience with various database/query dialects (e.g., MariaDB, HiveQL, CassandraQL, Spark SQL, Presto).
Working knowledge of CI/CD processes and software containerization.

BONUS SKILLS

Familiarity with additional technologies such as Kubernetes, Flink, Iceberg, Druid, Presto, Cassandra.
Working knowledge of AI development tooling and AI applications in software engineering.

OTHER SKILLS

Familiarity with stream processing frameworks like Spark Streaming or Flink.
Good communication and collaboration skills to interact effectively within and across teams.
Ability to produce clear, well-documented technical designs and articulate ideas to both technical and non-technical stakeholders.

Responsibilities

Designing and Building Data Pipelines: Develop scalable, robust infrastructure and processes using tools such as Airflow, Spark, and Kafka.
Monitoring and Alerting for Data Quality: Implement systems to detect and address potential data issues promptly.
Supporting Data Governance and Lineage: Assist in designing and implementing solutions to track and manage data across pipelines.
Collaborate with peers to improve and evolve the shared data platform, enabling use cases like product analytics, bot detection, and image classification.
Enhancing Operational Excellence: Identify and implement improvements in system reliability, maintainability, and performance.