Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
This project is a case study for a start-up, involved with recommending movies to users as a service, as well as investigating certain factors contributing to the success of movies.
The aim of this project, is to perform Extract, Transform, Load, on movies data, to answer questions the business may have about its users, such as:
The movies data and metadata comes from Movielens, extracted from a Kaggle dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset. The data contains 26 million user ratings of over 270,000 users on a collection of over 45,000 movies.
In addition, Consumer Price Index of Admission to Movies, Theaters, and Concerts in U.S. City Average is extracted from: https://fred.stlouisfed.org/series/CUSR0000SS62031. This will help us normalize box office earnings against inflation over the years.
The technical architecture for this project is as show below:
pip install kaggle
. To download the movielens dataset, runkaggle datasets download -d "rounakbanik/the-movies-dataset"
For St Louis Fred's Consumer Price Index dataset, run
wget https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CUSR0000SS62031&scale=left&cosd=1999-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-05-31&revision_date=2020-05-31&nd=1999-01-01
Next, copy the files downloaded from the EC2 instance to S3. Make sure that it has the aws-cli installed. Run aws configure
and then aws s3 cp {FILE} s3://{S3_BUCKET}/{S3_FOLDER}/
to transfer the files to S3. Note that if the situation changes such that this becomes a daily job, we can write a shell script containing these commands, and add the command to run this shell script in our Airflow data pipeline
Run the ETL pipeline, scheduled using Airflow. Data Processing is done using Spark, and data is eventually ingested into Redshift.
For data processing (data transformation step), Spark is chosen because of its parellel processing capabilities. Should the amount of data proliferate to 100x, more worker nodes can be added to the spark cluster to scale out.
For orchestrating the steps in the pipeline, Airflow is chosen as it allows building of data pipelines that are straghtforward and modular. Airflow allows tasks to be defined in a Directed Acyclic Graph (DAG) with dependencies of tasks between one another. This allows running of tasks to be optimized. It also enables the pipeline to be run on a schedule (for eg, daily) should the need arise. Finally, it has an intuitive UI that allows users to check the steps in the data pipeline should any part of the pipeline fail.
Redshift is chosen as the cloud Data Warehouse as it is highly scalable. Should our data grow in size, we can provision more nodes or scale up, to handle the larger volume of data.
Docker is used to encapsulate package dependencies the code may have, to allow the code to run on any machine
The approach taken, is to normalize the data. This will lead to more efficient UPDATES and DELETES as and when required.
The ETL process runs through an Airflow DAG:
The process is as follows:
Setting up
Build the docker/pull docker image
# Build the docker image
docker build -t {IMAGE_NAME} .
Alternatively, pull the docker image I have pushed to Dockerhub:
# Build the docker image
docker pull alanchn31/alanchn31-capstone-udacity-de-nd:1
Replace the webserver image in docker-compose-LocalExecutor.yml to the image name of the Docker image you have built/pulled.
webserver:
image: alanchn31/alanchn31-capstone-udacity-de-nd:1
Run docker-compose -f docker-compose-LocalExecutor.yml up -d
. Your airflow server should be initiated and will be up and running. Visit https://{your ec2 ip address}:8080
to view Airflow UI
Add necessary connections and variables in Airflow UI
There are 4 variables to be defined:
movie_s3_config
. It is defined as a json format as follows:
{
'aws_key': {AWS_KEY},
'aws_secret_key: {AWS_SECRET_KEY},
's3_bucket': {AWS_S3_BUCKET},
's3_key': {AWS_S3_KEY}
}
db_user
(user name of user with access to Redshift database)
db_pass
(password of user with access to Redshift database)
redshift_conn_string
(Redshift JDBC connection string for spark dataframe to write to Redshift)
In addition, define the Hook to connect to Redshift:
Conn Id: `redshift`.
Conn Type: `Postgres`.
Host: Enter the endpoint of your Redshift cluster, excluding the port at the end.
Schema: This is the Redshift database you want to connect to.
Login: Enter Redshift user
Password: Enter Redshift password
Port: Enter `5439`.
After configuring, visit Airflow UI and enable DAG to start the data pipeline
Many thanks to: