Sign up with

Already have an account? Log in here

Need some help?
Talk to us at +91 7670800001

Data Engineer at NovaGigs

McLean, Virginia, USA -

Full Time

Start Date

Immediate

Expiry Date

29 Jul, 25

Salary

0.0

Posted On

30 Apr, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Sql, Design Principles, Algebra, Pandas, Computer Science, Statistics, Econometrics, Sas, Integration

Industry

Computer Software/Engineering

Description

ID: 2021-006-py-data-eng
Title: Data Engineer
Job Location: McLean, VA
Posted Date: 06/7/2021
Job Description
Seeking a Data Engineer to assist in modernizing modeling processes for one of our large financial clients. You will port SAS code to Python code (Pandas and PySpark) utilizing AWS and EMR.

Responsibilities

Translate existing SAS code into Python code. You must know both Pandas data frames and PySpark data frames.
Verify that the Python version of the SAS code is equivalent to the SAS version. This involves running both processes, comparing the output, and resolving any differences.
Leverage PySpark and AWS EMR to parallelize the process and reduce the runtime.
Optimize the Python code to reduce the runtime.
Enhance the Python process to be fault-tolerant and contain checkpoints to make rerunning a subset of the process more efficient.
Write automated tests for Python code.
Peer review code and automated tests, help team members with design and implementation challenges.

Qualifications

At least 3 years of experience developing production Python code.
A strong understanding of Pandas and PySpark.
A strong understanding of SQL.
Experience with SAS.
Solid understanding of software design principles.

Preferred Skills

BS in Computer Science or equivalent experience.
Experience with cloud computing and storage services, particularly AWS EMR.
Experience writing automated unit, integration, regression, performance and acceptance tests.
Strong quantitative skills (statistics, econometrics, linear algebra).

Responsibilities

Translate existing SAS code into Python code. You must know both Pandas data frames and PySpark data frames.
Verify that the Python version of the SAS code is equivalent to the SAS version. This involves running both processes, comparing the output, and resolving any differences.
Leverage PySpark and AWS EMR to parallelize the process and reduce the runtime.
Optimize the Python code to reduce the runtime.
Enhance the Python process to be fault-tolerant and contain checkpoints to make rerunning a subset of the process more efficient.
Write automated tests for Python code.
Peer review code and automated tests, help team members with design and implementation challenges