Datalake Etl Pipeline Save

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Project README

Datalake ETL Pipeline

Data transformation simplified for any Data platform.

Features: The package has complete ETL process -

Uses metadata, transformation & data model information to design ETL pipeline
Builds target transformation SparkSQL and Spark Dataframes
Builds source & target Hive DDLs
Validates DataFrames, extends core classes, defines DataFrame transformations, and provides UDF SQL functions.
Supports below fundamental transformations for ETL pipeline -
- Filters on source & target dataframes
- Grouping and Aggregations on source & target dataframes
- Heavily nested queries / dataframes
Has complex and heavily nested XML, JSON, Parquet & ORC parser to nth level of nesting
Has Unit test cases designed on function/method level & measures source code coverage
Has information about delpoying to higher environments
Has API documentation for customization & enhancement

Enhancements: In progress -

Integrate Audit and logging - Define Error codes, log process failures, Audit progress & runtime information

Open Source Agenda is not affiliated with "Datalake Etl Pipeline" Project. README Source: vim89/datapipelines-essentials-python

Stars

Open Issues

Last Commit

11 months ago

Repository

vim89/datapipelines-essentials-python

License

Apache-2.0

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/datalake-etl-pipeline"><img src="https://www.opensourceagenda.com/projects/datalake-etl-pipeline/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022