Data Engineering Projects Save

Personal Data Engineering Projects

Project README

Description

This repo contains projects done which applies principles in data engineering.
Notes taken during the course can be found in folder 0. Back to Basics

Projects

Postgres ETL :heavy_check_mark:

This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have

Cassandra ETL :heavy_check_mark:

Looking at the realm of big data, Cassandra helps to ingest large amounts of data in a NoSQL context. This project adopts a query centric approach in ingesting data into data tables in Cassandra, to answer business questions about a music app

Web Scrapying using Scrapy, MongoDB ETL :heavy_check_mark:

In storing semi-structured data, one form to store it in, is in the form of documents. MongoDB makes this possible, with a specific collection containing related documents. Each document contains fields of data which can be queried.
In this project, data is scraped from a books listing website using Scrapy. The fields of each book, such as price of a book, ratings, whether it is available is stored in a document in the books collection in MongoDB.

Data Warehousing with AWS Redshift :heavy_check_mark:

This project creates a data warehouse, in AWS Redshift. A data warehouse provides a reliable and consistent foundation for users to query and answer some business questions based on requirements.

Data Lake with Spark & AWS S3 :heavy_check_mark:

This project creates a data lake, in AWS S3 using Spark.
Why create a data lake? A data lake provides a reliable store for large amounts of data, from unstructured to semi-structured and even structured data. In this project, we ingest json files, denormalize them into fact and dimension tables and upload them into a AWS S3 data lake, in the form of parquet files.

Data Pipelining with Airflow :heavy_check_mark:

This project schedules data pipelines, to perform ETL from json files in S3 to Redshift using Airflow.
Why use Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative

Capstone Project :heavy_check_mark:

This project is the finale to Udacity's data engineering nanodegree. Udacity provides a default dataset however I chose to embark on my own project.
My project is on building a movies data warehouse, which can be used to build a movies recommendation system, as well as predicting box-office earnings. View the project here: Movies Data Warehouse

Open Source Agenda is not affiliated with "Data Engineering Projects" Project. README Source: alanchn31/Data-Engineering-Projects

Stars

721

Open Issues

Last Commit

1 year ago

Repository

alanchn31/Data-Engineering-Projects

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/data-engineering-projects"><img src="https://www.opensourceagenda.com/projects/data-engineering-projects/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022