end-to-end data engineering project to get insights from PyPi using python and duckdb
This project is a collections of pipelines to get insights of your python project. It also serves as educational purpose (YouTube videos and blogs) to learn how to build data pipelines with Python, SQL & DuckDB.
The project is composed of series in 3 parts :
The project requires :
There's also a devcontainer for VSCode.
Finally a Makefile
is available to run common tasks.
A .env
file is required to run the project. You can copy the .env.example
file and fill the required values.
TABLE_NAME=pypi_file_downloads # output table name
S3_PATH=s3://my-s3-bucket # output s3 path
AWS_PROFILE=default # aws profile to use
GCP_PROJECT=my-gcp-project # GCP project to use
START_DATE=2023-04-01 # start date of the data to ingest
END_DATE=2023-04-03 # end date of the data to ingest
PYPI_PROJECT=duckdb # pypi project to ingest
GOOGLE_APPLICATION_CREDENTIALS=/path/to/my/creds # path to GCP credentials
motherduck_token=123123 # MotherDuck token
TIMESTAMP_COLUMN=timestamp # timestamp column name, use for partitions on S#
DESTINATION=local,s3,md # destinations to push data to, can be one or more
TRANSFORM_S3_PATH_INPUT=s3://my-input-bucket/pypi_file_downloads/*/*/*.parquet # For transform pipeline, input source data
TRANSFORM_S3_PATH_OUTPUT=s3://my-output-bucket/ # For transform pipeline, output source if putting data to s3
~/.aws/credentials
path) that has write access to the bucketOnce you fill your .env
file, do the following :
make install
: to install the dependenciesmake pypi-ingest
: to run the ingestion pipelinemake pypi-ingest-test
: run the unit tests located in /ingestion/tests
You can choose to push the data of the transform pipeline either to AWS S3 or to MotherDuck. Both pipelines rely on source data storing on AWS S3 (see Ingestion section for more details). You can use a public sample dataset for this part of the tutorial, which is located at s3://us-prd-motherduck-open-datasets/pypi/sample_tutorial/pypi_file_downloads/*/*/*.parquet
For AWS S3, you would need :
~/.aws/credentials
path) that has read access to the bucket source bucket and write to the destination bucket
For MotherDuck, you would need:Fill your .env
file with the following variables. Note that you can use the TRANSFORM_S3_PATH_INPUT value here below for the tutorial, it's a public bucket containing some sample data:
motherduck_token=123123
TRANSFORM_S3_PATH_INPUT=s3://us-prd-motherduck-open-datasets/pypi/sample_tutorial/pypi_file_downloads/*/*/*.parquet
TRANSFORM_S3_PATH_OUTPUT=s3://my-output-bucket/
You can then run the following commands :
make install
: to install the dependenciesmake pypi-transform START_DATE=2023-04-05 END_DATE=2023-04-07 DBT_TARGET=dev
: example of a run reading from AWS S3 and writing to AWS S3make pypi-transform START_DATE=2023-04-05 END_DATE=2023-04-07 DBT_TARGET=prod
: example of a run reading from AWS S3 and writing to MotherDuckmake pypi-transform-test
: run the unit tests located in /transform/pypi_metrics/tests