Data Engineer - Pyspark
at Virtusa
Dubai, دبي, United Arab Emirates -
Start Date | Expiry Date | Salary | Posted On | Experience | Skills | Telecommute | Sponsor Visa |
---|---|---|---|---|---|---|---|
Immediate | 27 Apr, 2025 | Not Specified | 28 Jan, 2025 | 3 year(s) or above | Hadoop,Routine Maintenance,Kafka,Code,Collaboration,Design,Performance Tuning,Processing,Orchestration,Maintenance,Information Systems,File Systems,Relational Databases,Cdp,Validation,Spark,Data Warehousing,Hive,Data Quality,Optimization Techniques | No | No |
Required Visa Status:
Citizen | GC |
US Citizen | Student Visa |
H1B | CPT |
OPT | H4 Spouse of H1B |
GC Green Card |
Employment Type:
Full Time | Part Time |
Permanent | Independent - 1099 |
Contract – W2 | C2H Independent |
C2H W2 | Contract – Corp 2 Corp |
Contract to Hire – Corp 2 Corp |
Description:
SKILL: PYSPARK, BIG DATA – HADOOP, HIVE, SPARK, KAFKA
Responsibilities:
· Data Pipeline Development: Design, develop, and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform, ensuring data integrity and accuracy.
· Data Ingestion: Implement and manage data ingestion processes from a variety of sources (e.g., relational databases, APIs, file systems) to the data lake or data warehouse on CDP.
· Data Transformation and Processing: Use PySpark to process, cleanse, and transform large datasets into meaningful formats that support analytical needs and business requirements.
· Performance Optimization: Conduct performance tuning of PySpark code and Cloudera components, optimizing resource utilization and reducing runtime of ETL processes.
· Data Quality and Validation: Implement data quality checks, monitoring, and validation routines to ensure data accuracy and reliability throughout the pipeline.
· Automation and Orchestration: Automate data workflows using tools like Apache Oozie, Airflow, or similar orchestration tools within the Cloudera ecosystem.
· Monitoring and Maintenance: Monitor pipeline performance, troubleshoot issues, and perform routine maintenance on the Cloudera Data Platform and associated data processes.
· Collaboration: Work closely with other data engineers, analysts, product managers, and other stakeholders to understand data requirements and support various data-driven initiatives.
· Documentation: Maintain thorough documentation of data engineering processes, code, and pipeline configurations.
EDUCATION AND EXPERIENCE
· Bachelor’s or Master’s degree in Computer Science, Data Engineering, Information Systems, or a related field.
· 3+ years of experience as a Data Engineer, with a strong focus on PySpark and the Cloudera Data Platform.
TECHNICAL SKILLS
· PySpark: Advanced proficiency in PySpark, including working with RDDs, DataFrames, and optimization techniques.
· Cloudera Data Platform: Strong experience with Cloudera Data Platform (CDP) components, including Cloudera Manager, Hive, Impala, HDFS, and HBase.
· Data Warehousing: Knowledge of data warehousing concepts, ETL best practices, and experience with SQL-based tools (e.g., Hive, Impala).
· Big Data Technologies: Familiarity with Hadoop, Kafka, and other distributed computing tools.
Responsibilities:
REQUIREMENT SUMMARY
Min:3.0Max:8.0 year(s)
Information Technology/IT
IT Software - DBA / Datawarehousing
Software Engineering
Graduate
Computer Science, Engineering, Information Systems
Proficient
1
Dubai, United Arab Emirates