Datahub Versions Save

The Metadata Platform for your Data Stack

v0.10.4

11 months ago

Release Highlights

User Experience

  • You can now create and assign Custom Ownership types within DataHub; plus, we now display the owner type on an Entity Page ownershiptype-displayed

  • Various bug fixes to Column Level Lineage visualization

Metadata ingestion

  • You can now define column-level lineage (aka fine-grained lineage) via our file-based lineage source
  • Looker: Ingest Looks that are not part of a Dashboard
  • Glue: Error reporting now includes lineage failures
  • BigQuery: Now support deduplicating LogEntries based on insertId, timestamp, and logName

Docs

  • CSV Enricher: improvements to sample CSV and recipe
  • Guide for changing default DataHub credentials
  • Updated guide to apply time-based filters on Lineage

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.10.3...v0.10.4

v0.10.3

1 year ago

Release Highlights

User Experience

  • Define Data Products via YAML and manage associated entities within a Domain
  • Search experience: quickly apply a filter at time of search
  • Form-based PowerBI ingestion

Developer Experience

  • Progress toward Removing Confluent Schema Registry requirement -- Helm & Quickstart simplifications to follow
    • NOTE: this will only work for new deployments of DataHub; If you have already deployed DataHub with Confluent Schema Registry, you will not be able to disable it
  • Delete CLI - correctly handles deleting timeseries aspects
  • Ongoing improvements to Quickstart stability
  • Support entity types filter in get_urns_by_filter
  • Search customization
    • regex based query matching
    • full control over scoring functions (useable on any document field, i.e. tags, deprecated flags, etc)
    • enable/disable fuzzy, prefix, exact match queries

Ingestion

  • BigQuery - Improve ingestion disk usage & speed; extract dataset usage from Views
  • Unity Catalog - Capture create/last modified timestamps; extract usage; data profiling support
  • PowerBI - Update workspace concept mapping; support modified_since, extract_dataset_schema, and more
  • Superset – support stateful ingestion
  • Business Glossary – Simplify ingestion source
  • Kafka – Add description in dataset properties
  • S3 – Support stateful ingestion & last_updated
  • CSV Enricher – Support updating more types
  • PII Classification - Configurable sample size
  • Nifi - Support Kerberos authentication

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.10.2...v0.10.3

v0.10.2

1 year ago

Known Issues

  • Postgresql: In release v0.10.1 the default value for max_threads was increased in the CLI from 1 to 15. This creates an issue with Postgresql transactions. The recommended workaround is to decrease the max_threads in your ingestion recipes to 1 if running Postgresql for the GMS backend.
  • BigQuery: BigQuery connector depends on a bad version of SQLParse, which manifest as str object is not callable error. This has since been fixed in CLI release version v0.10.2.2.

Release Highlights

Metadata Ingestion

New

  • [redshift] Redshift Combining Usage and Metadata Extraction
  • [bigquery] Cross-Project Usage Support (using File System)
  • [snowflake] Push down Lineage Extraction to Snowflake Access History API
  • [azure-ad] Support stateful ingestion - Automatically remove groups and users when they are removed in Azure.
  • [okta] Support stateful ingestion - Automatically remove groups and users when they are removed in Okta.
  • [tableau] Extract lineage from CSQL queries in Tableau ingestion
  • [snowflake] Better error message on key pair authentication
  • [sdk] Support executing GraphQL Queries via DataHubGraph
  • [unity] Support extracting ownership
  • [postgres] Support extracting metadata from all databases in a single recipe

Bug Fixes

  • [bigquery] Capture all operation types when ingesting operational stats
  • [bigquery] Fix and refactor exported audit logs query
  • [redshift] Fix SQL for extracting lineage from insert queries

User Experience

New

  • Auto-Complete UX Refresh - Quickly filter search results inside autocomplete experience
  • View: Support views on the Auto-Complete Search Bar

Bug Fixes

  • Fix bug where Tag names do not render properly in search previews
  • Fix bug where Tag color does not render properly in search autocomplete
  • Fix bug when adding Tags and Glossary Terms to nested schema fields
  • Fix bug where DataHub would redirect you when clicking to navigate back home
  • Fix bug where Metadata Tests results did not show if they were all passing

Documentation

Developer Experience

  • Add performance testing framework for BigQuery usage

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.10.1...v0.10.2

v0.10.1

1 year ago

v0.10.0

1 year ago

Release Highlights

Potential Downtime

This release introduces substantial improvements to search functionality which require reindexing indices.

During the reindexing:

  • a system-update job will set indices to read-only and create a backup/clone of each index
  • new components will be prevented from start-up until the reindex completes
  • Helm deployments will go into read-only mode and new ingestion runs will fail

This process can take anywhere from 5 minutes to multiple hours; as rough estimate, please expect it to take 1 hour for every 2.3 million entities. After the reindex is complete, please check your ingestion run to re-run any that did not complete.

If you are deploying containers yourself

If you're deploying the Docker containers yourself (without Helm or Docker-Compose Quickstart), then you'll need to ensure that you first run the acryldata/datahub-upgrade docker image (v0.10.0 tag) with the following environment variables enabled.

Then, run the container this with the command

docker run acryldata/datahub-upgrade:v0.10.0 -u SystemUpdate

For the full set of environment variables required, check out the default docker.env provided for Docker Compose deployments.

This will run the required reindex against your elasticsearch instance, after which other DataHub components should start correctly. If you do not run the datahub-upgrade container successfully, other components in the stack will fail to start correctly.

User Experience

We have some really exciting improvements to the DataHub user experience in this release!

Improved documentation editor, contributed by @ngamanda and the Grab Team. This work provides a much more intuitive documentation editing experience within the UI, providing “what you see is what you get” formatting & removing the need for markdown expertise.

Additionally, you can easily:

  • Add links to other entities/users within DataHub
  • embed and resize tables & images
  • toggle between font sizes and formats
  • embed syntax-highlighted code blocks

Filter lineage graphs based on time windows You can now easily see the full lineage graph of an entity at a specific point in time. This makes it much easier to understand how interdependencies have evolved over time and to troubleshoot data issues in the past.

Improvements in Search As noted above, we have rolled out substantial improvements to Search functionality, making it easier than ever for end-user to find the entities that matter most. This release includes:

  • Stemm & Synonyms
  • Search by full or partial URN
  • Autocomplete improvements
  • Quoted search analyzer for exact & prefix match

Metadata Ingestion

Here are some of the most notable ingestion-related improvements:

  • Redshift: You can now extract lineage information from unload queries – thanks for the contrib, @mmmeeedddsss
  • PowerBI: Ingestion now maps Workspaces to DataHub Containers – thanks for the contrib, @looppi
  • BigQuery: You can now extract lineage metadata from the Catalog API – thanks for the crontrib, @PatrickfBraz
  • Glue: Ingestion now uses table name as the human-readable name – thanks for the contrib, @danielcmessias

Developer Experience

  • This release introduces DataHub Lite - a new experimental lightweight implementation of DataHub. It is intended to enable local developer tooling use-cases such as simple access to metadata for scripts and other tools. DataHub Lite is compatible with the DataHub metadata format and all the ingestion connectors that DataHub supports. Checkout the docs here.

Breaking Changes

#7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the kafka-setup docker image have been updated to be in-line with other DataHub components, for more info see our docs on Configuring Kafka in DataHub . They have been suffixed with _TOPIC where as now the correct suffix is _TOPIC_NAME. This change should not affect any user who is using default Kafka names.

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.9.6...v0.10.0

v0.9.6.1

1 year ago

Release Highlights

Please upgrade from 0.9.6 ASAP to avoid ongoing issues creating and using secrets.

Important Release Notes

With this release, if you are using Neo4J as your graph implementation, you need to set: GRAPH_SERVICE_DIFF_MODE_ENABLED=false

For GMS (or MAE Consumer for standalone mode).

Bug fix for secrets encryption

  • Prevents decryption errors for existing secrets
  • Affects reading ingestion secret created with a previous release
  • Affects native user password validation

What's Changed

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.9.6...v0.9.6.1

v0.9.6

1 year ago

:warning: This Release has been patched. Please upgrade to 0.9.6.1 :warning:

As of January 19th, 2023 0.9.6.1 is now the official release build, and should be used over 0.9.6. Upgrade to 0.9.6.1 when possible to avoid issues creating and using secrets.

Release Highlights

Important Release Notes

With this release, if you are using Neo4J as your graph implementation, you need to set: GRAPH_SERVICE_DIFF_MODE_ENABLED=false

For GMS (or MAE Consumer for standalone mode).

User Experience

  • We now support embedding Dashboards, Charts, and Datasets. This allows us to do things like directly embed Looker / Tableau / Mode / Redash Looks, Dashboards, Explores into the Dataset pages themselves.

image

  • [Experimental] You can now customize the number of queries displayed on the Query tab of a Dataset entity

image

  • Improved error messaging for bulk editing via the UI

Metadata Ingestion

  • Update to data profiling to allow configurable number of sample values to be returned
  • Postgres ingestion now supports emitting lineage edges for Views - shoutout to @LucasRoesler for the contribution!
  • Snowflake ingestion now supports extracting tags - shoutout to @frsann for the contribution!
  • Vertica ingestion now supports projections and lineage- thanks for the contribution, @vishalkSimplify!
  • Glue ingestion now emits an s3 lineage edge when data was written with an s3a/s3n client - thanks for the contribution, @danielli-ziprecruiter!

Developer Experience

  • Fixes quickstart/docker compose issues for M1 machines
  • Improvements in reliability and performance of the Restli Service endpoints for ingestion:
    • Scale Restli Service thread pool based on CPU
    • Add retry (exp backoff) to Restli Entity Client
    • MCE no longer relies on GMS for Restli service
    • Converted Restli Service from standalone servlet to Spring injectable
    • Docker build externalized (significantly faster on m1, <7 minute build times, based on this)
    • Frontend asset generation refactor (causing tests to fail intermittently)

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.9.5...v0.9.6

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.9.5...v0.9.6

v0.9.5

1 year ago

Release Highlights

Notice: This PR includes a fix for Single Sign-On (OIDC) that was introduced in the previous release, v0.9.4.

Important Release Notes

With this release, if you are using Neo4J as your graph implementation, you need to set: GRAPH_SERVICE_DIFF_MODE_ENABLED=false

For GMS (or MAE Consumer for standalone mode).

User Experience

  • Manual Lineage is LIVE! You can now add and remove lineage between entities in the Lineage Visualization screen, making it easier than ever to manage the complex relationships between your data resources.

ui_lineage_1 ui_lineage_2 ui_lineage_3

  • Our new Views feature makes it easy to create curated sets of Entities within DataHub. This is a great way to start to isolate the entities that matter most, and provide your DataHub end-users with a streamlined view of the assets that are relevant to their use cases. See the original demo video.

create_view sharing_views

  • In-App Product Tours are here! When logging into DataHub and/or visiting a new page type for the first time, new users will be prompted with a helpful walkthrough of core functionality to get them familiar with the platform. We’ll continue to add modules as we roll out new features!

in_app_product_tour

  • Automatically send updates to Slack and/or Microsoft Teams when changes are made within DataHub by leveraging our the new Slack and Teams Actions.

Metadata Ingestion

We’re continuing to improve the user experience for UI-based ingestion for the following sources:

  • DataBricks Unity Catalog
  • dbt Cloud
  • MySQL
  • Trino/Presto
  • Microsoft SQL Server
  • MariaDB

If you’re just getting started with UI-based Ingestion, check out our new BigQuery & Snowflake guides.

Stateful ingestion is now supported for Iceberg (thanks for the contrib, @cccs-Dustin!) and LDAP (thanks for the contrib, @bda618!)

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.9.4...v0.9.5

v0.9.4

1 year ago

Known Issues

In this release, the version of our OIDC SSO library was majorly upgraded. There is an issue with how the newer version of the library interacts with OIDC providers. We have addressed this issue in v0.9.5. We recommend avoiding upgrading to this version if your organization is actively using OIDC to manage user authentication.

Important Release Notes

With this release, if you are using Neo4J as your graph implementation, you need to set: GRAPH_SERVICE_DIFF_MODE_ENABLED=false

For GMS (or MAE Consumer for standalone mode).

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.9.3...v0.9.4

v0.9.3

1 year ago

Release Highlights

Important Release Notes

With this release, if you are using Neo4J as your graph implementation, you need to set: GRAPH_SERVICE_DIFF_MODE_ENABLED=false

For GMS (or MAE Consumer for standalone mode).

User Experience

  • Column Level Lineage Impact Analysis is live! Read more about it here
  • You can now sort Dataset field names alphabetically - this is super handy for finding columns within wide datasets that may not have an easy-to-follow order by default

  • New - an “Explore All” button on the home page, making it easier to jump into the search experience

  • Plus! We now have a “Share” button on entity pages, making it easier for you to share DataHub links with others

  • [Community Contribution] You can now assign the same user as different owner types - thanks for the contrib, @rtekal!

  • [Community Contribution] You can now see recommendations for Recently Edited entities on the homepage! - thanks for the contrib, @CorentinDuhamel

Metadata Ingestion

  • Snowflake Automated PII Classification is here! We’re eager for feedback on the utility of this feature - check out this guide, take it for a spin, and let us know what you think!
  • NEW! dbt Cloud ingestion is ready for ya - check out the module details here
  • We’ve simplified the configs required to add stateful ingestion to an ingestion source - check out the updated docs here
  • Speaking of stateful ingestion, it’s now available with:
    • Looker & LookML ingestion sources
    • [Community Contribution] Container-level ingestion – thanks for the contrib, @wangsaisai!

Developer Experience

  • [Community Contribution] For those of you deploying DataHub with Neo4j, we now support Lineage Impact analysis via Neoj4 mulithop functionality. Thanks for the contrib, @djordje-mijatovic!
  • We’ve loosened our SQLAlchemy dependencies to support Airflow 2.3+

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.9.2...v0.9.3