New Generation Opensource Data Stack Demo
This repository contains a stock market analysis demo of the ngods data stack. The demo performs the following steps:
The demo is packaged as docker-compose script that downloads, installs, and runs all components of the data stack.
ngods stands for New Generation Opensource Data Stack. It includes the following components:
ngods is open-sourced under a BSD license and it is distributed as a docker-compose script that supports Intel and ARM architectures.
ngods requires a machine with at least 16GB RAM and Intel or Arm 64 CPU running Docker. It requires docker-compose.
git clone https://github.com/zsvoboda/ngods-stocks.git
docker-compose up
commandcd ngods-stocks
docker-compose up -d
NOTE: This can take quite long depending on your network speed.
docker-compose down
commanddocker-compose down
Cut and paste the content of the e2e.yaml file to this Dagster UI console page and start the data pipeline by clicking the Launch Run
button.
NOTE: You can customize the list of stock symbols that will be downloaded.
See the cube.dev documentation for more information.
Use username [email protected]
and password metabase1
.
You can create your own data visualizations and dashboards. See the Metabase documentation for more information.
Apple:AAPL
stock data and predicts the next month.Download DBeaver SQL tool.
Connect to the Postgres database that contains the gold
stage data. Use jdbc:postgresql://localhost:5432/ngods
JDBC URL with username ngods
and password ngods
.
bronze
, silver
, and gold
schemas of the warehouse
database). Use jdbc:trino://localhost:8060
JDBC URL with username trino
and password trino
.jdbc:hive2://localhost:10009
JDBC URL with no username and password.This chapter contains useful information for customizing the demo.
Here are few distribution's directories that you may need to customize:
conf
configuration of all data stack components
cube
cube.dev schema (semantic model definition)data
main data directory
minio
root data directory (contains buckets and file data)spark
Jupyter notebooksstage
file stage data. Spark can access this directory via /var/lib/ngods/stage
path.projects
dbt, Dagster, and DataHub projects
dagster
Dagster orchestration projectdbt
dbt transformations (one project per each medallion stage: bronze
, silver
, and gold
)The data stack has the following endpoints
jdbc:hive2://localhost:10009
JDBC URL (no username / password)jdbc:trino://localhost:8060
JDBC URL (username trino
/ no password)jdbc:postgresql://localhost:5432/ngods
JDBC URL (username ngods
/ password ngods
)jdbc:postgresql://localhost:3245/cube
JDBC URL (username cube
/ password cube
)[email protected]
/ password metabase1
)minio
/ password minio123
)ngods stack includes three database engines: Spark, Trino, and Postgres. Both Spark and Trino have access to Iceberg tables in warehouse.bronze
and warehouse.silver
schemas. Trino engine can also access the analytics.gold
schema in Postgres. Trino can federate queries between the Postgres and Iceberg tables.
The Spark engine is configured for ELT and pyspark data transformations.
The Trino engine is configured for data federation between the Iceberg and Postgres tables. Additional catalogs can be configured as needed.
The Postgres database has accesses only to the analytics.gold
schema and it is used for executing analytical queries over the gold data.
The demo data pipeline is utilizes the medallion architecture with bronze
, silver
, and gold
data stages.
and consists of the following phases:
All data pipeline phases are orchestrated by Dagster framework. Dagster operations, resources and jobs are defined in the Dagster project.
The pipeline is executed by running the e2e job from the Dagster console at http://localhost:3070/ using this yaml config file
ngods includes cube.dev for semantic data model and Metabase for self-service analytics (dashboards, reports, and visualizations).
Analytical (semantic) model is defined in cube.dev and is used for executing analytical queries over the gold data.
Metabase is connected to the cube.dev via SQL API. End users can use it for self-service creation of dashboards, reports, and data visualizations. Metabase is also directly connected to the gold schema in the Postgres database.
Jupyter Notebooks with Scala, Java and Python backends can be used for machine learning.
Create a github issue if you have any questions.