Awesome Data Catalogs Save

📙 Awesome Data Catalogs and Observability Platforms.

Project README

Awesome Data Discovery and Observability

This repository contains a curated list of awesome data catalogs and observability platforms that help you discover, manage, and observe data in your organization.

Contents: Existing Data Discovery and Observability Solutions

OSS Data Catalogs	Proprietary Monocloud DCs	Proprietary Observability Tools	Other Proprietary DCs
📙 Amundsen	📒 Google DC	🔍 Monte Carlo	📕 Alation
📙 DataHub	📒 Azure DC	🔍 Databand	📕 Atlan
📙 Marquez		🔍 Datafold	📕 Collibra
📙 Atlas		🔍 Ataccama	📕 DataGalaxy
📙 CKAN			📕 Informatica
📙 Magda			📕 Stemma
📙 OpenDataDiscovery			📕 Talend
📙 OpenMetadata			📕 Select Star
📙 Meta#Grid
📙 Grai

High-Level Feature Comparison

Tool	Specification -Based	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observ- ability	Column-level lineage	Data collaboration
Alation	❌	✔️	❌	✔️	❌	❌	✔️	❌	❌	❌	❌
Amundsen	❌	✔️	✔️	✔️	❌	❌	❌	❌	❌	❌	❌
Ataccama	❌	✔️	❌	✔️	❌	❌	✔️	❌	❌	❌	❌
Atlan	❌	✔️	❌	✔️	❌	❌	✔️	❌	❌	✔️	✔️
Atlas	❌	✔️	❌	✔️	❌	❌	❌	❌	❌	❌	❌
Azure DC	❌	✔️	?	✔️	❌	❌	?	❌	❌	❌	❌
CKAN	❌	✔️	❌	❌	✔️	❌	❌	❌	❌	❌	❌
Collibra	❌	✔️	?	✔️	❌	❌	?	❌	❌	❌	❌
DataGalaxy	❌	✔️	✔️	✔️	❌	❌	❌	✔️	✔️	?	?
Databand	❌	?	?	?	❌	?	?	?	✔️	❌	❌
Datafold	❌	✔️	✔️	✔️	❌	❌	✔️	❌	✔️	❌	❌
DataHub	✔️ details	✔️	✔️	✔️	✔	✔	✔	✔	❌	✔	❌
Google DC	❌	✔️	❌	✔️	❌	❌	?	❌	❌	❌	❌
Informatica	❌	✔️	✔️	✔️	❌	❌	✔️	❌	❌	?	❌
Magda	❌	✔️	❌	❌	✔️	❌	❌	❌	❌	❌	❌
Marquez	OpenLineage	✔️	❌	✔️	?	❌	❌	❌	❌	✔️	❌
Monte Carlo	❌	✔️	❌	✔️	❌	❌	✔️	❌	✔️	❌	❌
Select Star	❌	✔️	✔️	✔️	✔️	❌	❌	✔️	❌	✔️	✔️
OpenDataDiscovery	ODD Specification	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	❌	✔️
OpenMetadata	JSON Schema	✔️	✔️	✔️	✔️	✔️	✔️	✔️	❌	✔️	✔️
Stemma	❌	✔️	✔️	✔️	❌	❌	?	✔️	❌	❌	❌
Talend	❌	✔️	?	✔️	❌	❌	✔️	❌	❌	❌	❌
Meta#Grid	❌	✔️	❌	✔️	❌	❌	not yet	❌	❌	❌	✔️
Grai	Grai Schemas	✔️	❌	✔️	❌	✔️	✔️	❌	❌	✔️	✔️

Definitions:

Specification-based - uses an open standard for collecting metadata to allow efficient time-to-discovery and federating data catalogs
Search-based - allows to search for data assets
Network-based - provides rich context about data asset ownership
Lineage-based - provides lineage for all entities the solution operates
Federation - the ability to map multiple data catalogs into a single UI to avoid repeated data collection.
ML 1st citizen - operates ML entities on a high level - you can use them as any other data assets.
Data Quality - includes mature data quality assurance tools.
End-to-end lineage - data lineage that includes all data assets used in the organization across all its data catalogs and ML tools.
Column-level lineage - data lineage with column level granularity
Data collaboration - provides possibility to bring together data from various internal and external sources to unlock combined data insights

📙 Open-Source Data Catalogs

Amundsen

Website | GitHub

A popular open-source data catalog for metadata management and data discovery originated from Lyft. Created by Amundsen maintainers, Stemma provides a managed version of an enterprise data catalog, inspired by Amundsen.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	✔️	✔️	❌	❌	❌	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: Yes
Schemas, Description: Yes
Complex schemas: No
Data preview: Yes
Column statistics: Yes
Data owner: Yes
Top data users: Yes
Change notifications:No
Change feed: No
Deployment:
Supported data sources: Hive, Redshift, Druid, RDBMS, Presto, Snowflake

DataHub

Website | GitHub

DataHub is an open-source data catalog enabling data discovery, data observability and federated governance that originated from LinkedIn and is commercially offered by Acryl Data as a cloud-hosted SaaS offering.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
✔️ details	✔️	✔️	✔️	✔	✔	✔	✔	❌

More features

Strategy: Push, Pull
Customizable metadata model: Yes. The metadata model can be declared using the open-source Pegasus language, and is interoperable with JSONSchema and Avro
Rich data profiling: Yes
Recommendations: Yes
Schemas, Description: Yes
Complex schemas: Yes
Data preview: Yes
Column statistics: Yes
Data owner: Yes
Top data users: Yes
Lineage impact analysis: Yes
Change notifications: Yes
Change feed: No
Automation: Yes
UX personalization: No
Deployment: docker-compose / Kubernetes with Helm, or using Acryl Data's SaaS offering
Supported data sources:
- Snowflake
- BigQuery
- Redshift
- Hive
- Athena
- Postgres
- MySQL
- SQL server
- Trino
- Delta Lake
- S3
- Looker
- PowerBI
- Tableau
- Mode
- Metabase
- Redash
- Superset
- Airflow
- Great Expectation
- dbt
- Feast
- SageMaker
- Glue
- Kafka
- Nifi
- Okta
- LDAP
- Slack
- There's 50+ integrations - see the docs for the latest.

Marquez

Website | GitHub

Marquez is an open-source data catalog for collection, aggregation, and visualization of a data ecosystem’s metadata originated from WeWork.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
OpenLineage	✔️	❌	✔️	?	❌	❌	❌	❌	✔️	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: No
Schemas, Description: Yes
Complex schemas: No
Data preview: Yes
Column statistics: No
Data owner: Yes
Top data users: ?
Change notifications: No
Change feed: No
Deployment:
Supported data sources: S3, Kafka

Atlas

Website | GitHub

Apache Atlas is an open-source data catalog for metadata collection, governance, and data democratization.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	✔️	❌	❌	❌	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: No
Schemas, Description: Yes
Complex schemas: No
Data preview: No
Column statistics: No
Data owner: No
Top data users: ?
Change notifications: Yes
Change feed: No
Deployment:
Supported data sources:HBase, Hive, Sqoop, Kafka, Storm

CKAN

Website | GitHub

CKAN is an open-source data catalog for data management, powering data portals for govenments and enterprises.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	❌	✔️	❌	❌	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: ?
Schemas, Description: ?
Complex schemas: ?
Data preview: ?
Column statistics: ?
Data owner: ?
Top data users: ?
Change notifications: ?
Change feed: ?
Deployment:
Supported data sources:

Magda

Website | GitHub

Magda is an open-source data catalog that features data discovery, metadata enrichment, and federation, focused on geodata.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	❌	✔️	❌	❌	❌	❌	❌	❌

More features

Strategy: Push via UI
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: No
Schemas, Description: Yes
Complex schemas: No
Data preview: Yes
Column statistics: No
Data owner: Yes
Top data users: ?
Change notifications: No
Change feed: No
Deployment:
Supported data sources: Mostly geodata

OpenDataDiscovery

Website | GitHub

First open-source data discovery and observability platform. ODD Platform is based on ODD Specification.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	❌	✔️

More features

Strategy: Push/Pull
UX personalization: No
Rich data profiling: Yes
Data collaboration: Yes
Schemas, Description: Yes
Complex schemas: Yes
Data preview: Yes
Column statistics: Yes
Data owner: Yes
Change notifications: Yes
Change feed: Yes
Metadata versioning: Yes
SaaS: Yes
Third-party integrations: Dbt, Great Expectations, and Prefect
Supported data sources: Airflow, Athena, AzureSQL, BigQuery, Clickhouse, Databricks, DeltaLake, Druid, DynamoDB, Fivetran, Glue, Hive, Kafka, Looker, MariaDB, MlFlow, MSSQL, MySQL, Oracle, Postgres, Presto, Redash, Redpanda, Redshift, Snowflake, Tableau, and Vertica

OpenMetadata

Website | GitHub

OpenMetadata is the all-in-one platform for data collaboration, discovery, governance, lineage, and quality that lets you focus on building and analyzing.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	❌	✔️	✔️

More features

Strategy: Push/Pull
UX personalization: No
Rich data profiling: Yes
Data collaboration: Yes
Schemas, Description: Yes
Complex schemas: Yes
Data preview: Yes
Column statistics: Yes
Data owner: Yes
Change notifications: Yes
Change feed: Yes
Metadata versioning: Yes
SaaS: Yes
Third-party integrations: Dbt, Great Expectations, and Prefect
Supported data sources: Airbyte, Airflow, Athena, AzureSQL, BigQuery, Clickhouse, Dagster, Databricks, DB2, DeltaLake, Druid, DynamoDB, Fivetran, Glue, Glue, Hive, Kafka, Looker, MariaDB, Metabase, MlFlow, Mode, MSSQL, MySQL, NiFi, Oracle, Postgres, PowerBI, Presto, Redash, Redpanda, Redshift, Salesforce, SingleStore, Snowflake, Superset, Tableau, Trino, and Vertica

Meta#Grid

Website | GitHub | Docs

Meta#Grid is an open source data catalog for metadata management. It is designed to help small and large organizations create an inventory of their data silos and connect between different technologies. Through a multi-client system, with granular permissions system, Meta#Grid can be used in consulting companies (with diverse clients and projects) as well as in data mesh organizations. It grows with the requirements of the demand.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	✔️	❌	❌	not yet	❌	❌	❌	✔️

More features

Strategy: Push, Pull
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: Yes
Schemas, Description: Yes
Complex schemas: Yes
Data preview: No
Column statistics: No
Data owner: Yes
Top data users: No
Change notifications: Yes
Change feed: Yes
Deployment:
Supported data sources: Hive, Redshift, Druid, RDBMS, Presto, Snowflake

Grai

Website | GitHub | Docs

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
Grai Schemas	✔️	❌	✔️	❌	✔️	✔️	❌	❌	✔️	✔️

More features

Strategy: Push, Pull
Customizable metadata model: Yes. The metadata model can be flexibly extended or modified as needed.
Rich data profiling: No
Recommendations: No
Schemas, Description: Yes
Complex schemas: Yes
Data preview: No
Column statistics: No
Data owner: Yes
Top data users: No
CI Integration: Yes
Lineage impact analysis: Yes
Change notifications: Yes
Change feed: Yes
Automation: Yes
UX personalization: Yes
Deployment: docker-compose / Kubernetes with Helm, or using Grai SaaS offering
Supported data sources:
- Snowflake
- BigQuery
- Redshift
- Postgres
- MySQL
- dbt
- Slack
- ... many others see the docs for a full list.

📕 Proprietary Data Catalogs

Collibra

Website | GitHub

Collibra is an enterprise data catalog that helps to discover and understand data that matters and drive impactful insights from it.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	?	✔️	❌	❌	?	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: Yes
AI autowiring: ?
Network-based: No
Rich data profiling: ?
Supported data sources:

Informatica

Website | GitHub

Informatica is an enterprise data catalog that provides AI-powered data discovery engine to scan and catalog data assets.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	✔️	✔️	❌	❌	✔️	❌	❌	?	❌

More features

Strategy: Push
UX personalization: ?
AI autowiring: ?
Network-based: Yes
Rich data profiling: Yes
Supported data sources:

Alation

Website | GitHub

Alation is a collaborative data catalog that helps companies to drive value and business impact from their data.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	✔️	❌	❌	✔️	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: Yes
AI autowiring: No
Network-based: No
Rich data profiling: No
Supported data sources:

Atlan

Website | GitHub

Atlan is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	✔️	❌	❌	✔️	❌	❌	✔️	✔️

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: No
Rich data profiling: ?
Supported data sources: Presto, Deequ, Atlas, Airflow, Hudi

DataGalaxy

Website | GitHub

DataGalaxy is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	✔️	✔️	❌	❌	❌	✔️	✔️	?	?

More features

Strategy: Pull & Push
UX personalization: Yes
AI autowiring: Yes
Network-based: Yes
Rich data profiling: Yes
Supported data sources: [Available connectors](https://www.datagalaxy.com/fr/integrations-connecteurs/)

Stemma

Website

Stemma is a fully managed data catalog powered by the open-source data catalog Amundsen that helps data teams have total trust in their data.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	✔️	✔️	❌	❌	?	✔️	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Network-based: No
Rich data profiling: No
Supported data sources:

Talend

Website | GitHub

Talend is a data catalog that helps enterprises power critical business descisions with trusted data.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	?	✔️	❌	❌	✔️	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: Yes
AI autowiring: ?
Network-based: ?
Rich data profiling: Yes
Supported data sources:

Select Star

Website

Select Star is an intelligent data discovery platform that automatically analyzes and documents your data. Select Star provides an easy to use data portal that everyone can use to find and understand data.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	✔️	✔️	✔️	❌	❌	✔️	❌	✔️	✔️

More features

Strategy: Pull
AI autowiring: Yes
Network-based: Yes
Rich data profiling: No
ER Diagram generation: Yes
Role & Policy based access control: Yes
Popularity & usage: Yes
Description & Tag propagation: Yes
Data preview: Yes
Data owners: Yes
Top data users: Yes
UX personalization: No
Supported data sources:
- Snowflake
- BigQuery
- Redshift
- Postgres
- Looker
- PowerBI
- Tableau
- Mode
- Sigma
- Sisense
- Metabase
- Looker Studio
- DBT Cloud & Core
- Slack

📒 Monocloud Data Catalogs

Google Cloud Data Catalog

Website | GitHub

Google Cloud Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	✔️	❌	❌	?	❌	❌	❌	❌

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: No
Rich data profiling: No
Supported data sources:

Azure Data Catalog

Website

Azure Data Catalog is a fully managed, enterprise-wide metadata catalog that makes data asset discovery straightforward.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	?	✔️	❌	❌	?	❌	❌	❌	❌

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources:

🔍 Data Observability Platforms

Monte Carlo

Website

Monte Carlo is a data observability tool that helps to increase trust in data by eliminating or preventing data downtime.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	✔️	❌	❌	✔️	❌	✔️	❌	❌

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources: Snowflake, Hive, Kafka, Looker, Redshift, Tableau, Big Query, Airflow, Fivetran, Presto, Mode, Periscope, Databricks, Glue, dbt, Chartio, Spark, AWS, S3, data.world, Google Cloud Platform

Databand

Website | GitHub

Databand is an observability platform that helps data engineers identify and troubleshoot pipeline issues and data quality problems.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	?	?	?	❌	?	?	?	✔️	?	?

More features

Strategy: Push
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources:

Datafold

Website | GitHub

Datafold is a data monitoring and observability platform that gives you confidence in your data quality through diffs, profiling, and anomaly detection.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	✔️	✔️	❌	❌	✔️	❌	✔️	?	?

More features

Strategy: Push
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources:

Ataccama

Website | GitHub

Ataccama is an enterprise data catalog and observability tool featuring data profiling and data quality management, designed for data professionals.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability	Column-level lineage	Data collaboration
❌	✔️	❌	✔️	❌	❌	✔️	❌	❌	❌	❌

More features

Strategy: Pull
UX personalization: Yes
AI autowiring: No
Network-based: No
Rich data profiling: Yes
Supported data sources:

Open Source Agenda is not affiliated with "Awesome Data Catalogs" Project. README Source: opendatadiscovery/awesome-data-catalogs

Stars

577

Open Issues

Last Commit

7 months ago

Repository

opendatadiscovery/awesome-data-catalogs

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/awesome-data-catalogs"><img src="https://www.opensourceagenda.com/projects/awesome-data-catalogs/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022