Data Engineer at NovaGigs
McLean, Virginia, USA -
Full Time


Start Date

Immediate

Expiry Date

29 Jul, 25

Salary

0.0

Posted On

30 Apr, 25

Experience

3 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Sql, Design Principles, Algebra, Pandas, Computer Science, Statistics, Econometrics, Sas, Integration

Industry

Computer Software/Engineering

Description

ID: 2021-006-py-data-eng
Title: Data Engineer
Job Location: McLean, VA
Posted Date: 06/7/2021
Job Description
Seeking a Data Engineer to assist in modernizing modeling processes for one of our large financial clients. You will port SAS code to Python code (Pandas and PySpark) utilizing AWS and EMR.

Responsibilities

  • Translate existing SAS code into Python code. You must know both Pandas data frames and PySpark data frames.
  • Verify that the Python version of the SAS code is equivalent to the SAS version. This involves running both processes, comparing the output, and resolving any differences.
  • Leverage PySpark and AWS EMR to parallelize the process and reduce the runtime.
  • Optimize the Python code to reduce the runtime.
  • Enhance the Python process to be fault-tolerant and contain checkpoints to make rerunning a subset of the process more efficient.
  • Write automated tests for Python code.
  • Peer review code and automated tests, help team members with design and implementation challenges.

Qualifications

  • At least 3 years of experience developing production Python code.
  • A strong understanding of Pandas and PySpark.
  • A strong understanding of SQL.
  • Experience with SAS.
  • Solid understanding of software design principles.

Preferred Skills

  • BS in Computer Science or equivalent experience.
  • Experience with cloud computing and storage services, particularly AWS EMR.
  • Experience writing automated unit, integration, regression, performance and acceptance tests.
  • Strong quantitative skills (statistics, econometrics, linear algebra).
Responsibilities
  • Translate existing SAS code into Python code. You must know both Pandas data frames and PySpark data frames.
  • Verify that the Python version of the SAS code is equivalent to the SAS version. This involves running both processes, comparing the output, and resolving any differences.
  • Leverage PySpark and AWS EMR to parallelize the process and reduce the runtime.
  • Optimize the Python code to reduce the runtime.
  • Enhance the Python process to be fault-tolerant and contain checkpoints to make rerunning a subset of the process more efficient.
  • Write automated tests for Python code.
  • Peer review code and automated tests, help team members with design and implementation challenges
Loading...