Data Engineer - Pyspark

at  Virtusa

Dubai, دبي, United Arab Emirates -

Start DateExpiry DateSalaryPosted OnExperienceSkillsTelecommuteSponsor Visa
Immediate27 Apr, 2025Not Specified28 Jan, 20253 year(s) or aboveHadoop,Routine Maintenance,Kafka,Code,Collaboration,Design,Performance Tuning,Processing,Orchestration,Maintenance,Information Systems,File Systems,Relational Databases,Cdp,Validation,Spark,Data Warehousing,Hive,Data Quality,Optimization TechniquesNoNo
Add to Wishlist Apply All Jobs
Required Visa Status:
CitizenGC
US CitizenStudent Visa
H1BCPT
OPTH4 Spouse of H1B
GC Green Card
Employment Type:
Full TimePart Time
PermanentIndependent - 1099
Contract – W2C2H Independent
C2H W2Contract – Corp 2 Corp
Contract to Hire – Corp 2 Corp

Description:

SKILL: PYSPARK, BIG DATA – HADOOP, HIVE, SPARK, KAFKA

Responsibilities:
· Data Pipeline Development: Design, develop, and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform, ensuring data integrity and accuracy.
· Data Ingestion: Implement and manage data ingestion processes from a variety of sources (e.g., relational databases, APIs, file systems) to the data lake or data warehouse on CDP.
· Data Transformation and Processing: Use PySpark to process, cleanse, and transform large datasets into meaningful formats that support analytical needs and business requirements.
· Performance Optimization: Conduct performance tuning of PySpark code and Cloudera components, optimizing resource utilization and reducing runtime of ETL processes.
· Data Quality and Validation: Implement data quality checks, monitoring, and validation routines to ensure data accuracy and reliability throughout the pipeline.
· Automation and Orchestration: Automate data workflows using tools like Apache Oozie, Airflow, or similar orchestration tools within the Cloudera ecosystem.
· Monitoring and Maintenance: Monitor pipeline performance, troubleshoot issues, and perform routine maintenance on the Cloudera Data Platform and associated data processes.
· Collaboration: Work closely with other data engineers, analysts, product managers, and other stakeholders to understand data requirements and support various data-driven initiatives.
· Documentation: Maintain thorough documentation of data engineering processes, code, and pipeline configurations.

EDUCATION AND EXPERIENCE

· Bachelor’s or Master’s degree in Computer Science, Data Engineering, Information Systems, or a related field.
· 3+ years of experience as a Data Engineer, with a strong focus on PySpark and the Cloudera Data Platform.

TECHNICAL SKILLS

· PySpark: Advanced proficiency in PySpark, including working with RDDs, DataFrames, and optimization techniques.
· Cloudera Data Platform: Strong experience with Cloudera Data Platform (CDP) components, including Cloudera Manager, Hive, Impala, HDFS, and HBase.
· Data Warehousing: Knowledge of data warehousing concepts, ETL best practices, and experience with SQL-based tools (e.g., Hive, Impala).
· Big Data Technologies: Familiarity with Hadoop, Kafka, and other distributed computing tools.

Responsibilities:


REQUIREMENT SUMMARY

Min:3.0Max:8.0 year(s)

Information Technology/IT

IT Software - DBA / Datawarehousing

Software Engineering

Graduate

Computer Science, Engineering, Information Systems

Proficient

1

Dubai, United Arab Emirates