Datahub Versions Save

The Metadata Platform for your Data Stack

v0.13.2

1 month ago

Hotfix Release

Fixes MCL message deserialization bug when using internal schema registry and running specific upgrade jobs.

policyFields (enabled by default): BOOTSTRAP_SYSTEM_UPDATE_POLICY_FIELDS_ENABLED:true

dataJobNodeCLL (disabled by default): BOOTSTRAP_SYSTEM_UPDATE_DATA_JOB_NODE_CLL_ENABLED:false

Example Error:

Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 1
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 13 out of bounds for length 2
        at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:460)
        at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:283)
        at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
        at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
        at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)

Recovery Directions:

If currently affected, please remove the topic prior to upgrading to v0.13.2 to remove the corrupted message. The default topic name is MetadataChangeLog_Versioned_v1 however if you've customized the topic name be sure to remove that topic.

If running kafka per the example Helm chart for prerequisites the following command will delete the topic.

kubectl exec -it prerequisites-kafka-broker-0 -c kafka -- kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic MetadataChangeLog_Versioned_v1

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.13.1...v0.13.2

v0.13.1

1 month ago

DataHub Release Notes

User Experience

  • Capture and Manage Common Joins between Datasets: Users can now view and manage common join relationships between datasets, making it easier than ever to capture best practices and bespoke join logic. Watch the walkthrough here! 8325
    • Head's up: you'll need to enable the ER_MODEL_RELATIONSHIP_FEATURE_ENABLED env variable to use this feature!
  • Enhanced UI Interactions: Users can now enjoy an improved markdown editor and filter policies by active/inactive statuses, resulting in a more intuitive and manageable interface. 9949, 9958
  • Visual Context for Groups: You can now include picture links for groups in the UI, adding a richer visual context and enhancing the navigational experience. 9882
  • Improved Error Visibility: The UI now displays error messages related to data size limitations, allowing for better troubleshooting and user experience. 10038

Developer Experience

  • Enhanced Kafka Compatibility: Updated client version for Kafka setup ensures better compatibility and functionality for developers. 9962
  • Optimized Docker Build: Docker setups now respect pip mirrors, optimizing the build process especially in restricted network environments. 9963
  • Advanced Error Handling: New error handling for duplicate class names and improved fspath lint error management enhance the code reliability and quality. 9960, 9976
  • Latest OpenSearch Image: Incorporation of OpenSearch image version 2.11.0 aligns with the latest stable releases, boosting performance and security. 9984

Metadata Ingestion

  • NEW: Dagster Integration: You can now seamlessly ingest your Dagster Pipelines, Jobs, Ops, and lineage into DataHub. 10071
  • Expanded Field Classification Support: This release introduces support for field-level classification during ingestion for Redshift, BigQuery, DynamoDB, and SQL Sources. 10013, 10031
  • Enhanced Ingestion Capabilities: DataHub now offers stateful ingestion by default, optimizing routines for REST sinks and improving metadata accuracy across diverse sources like dbt and BigQuery. 9934, 10158, 10080
  • Better Data Lineage: This release introduced support for Openlineage in service of the Spark Lineage Beta Plugin; additionally, we now support incremental Column-Level Lineage, improving the accuracy of detecting column-level relationships during ingestion.9870, 9967, 10090
  • Schema Clarity: New descriptions support for JSON schema arrays and a mechanism to escape special characters in BigQuery table descriptions aid in clearer schema validation and ingestion processes. Databricks ingestion now supports Hive Metastore schemas with special characters. 9757, 9932, 10049

Version Upgrades

  • Kafka client and OpenSearch image were updated to the latest versions.

Breaking Changes

This release introduces default settings for stateful ingestion and updates in handling dbt ingestion. For details on all breaking changes, view the full documentation here.

Contributors

MASSIVE shoutout to our contributors!

First-Time Contributors

akarsh991, alexs-101, AvaniSiddhapuraAPT, diegmonti, dushayntAW, filipe-caetano-ovo, HuanjieGuo, jayacryl, k7ragav, kopax-polyconseil, LePuppy, Nelvin73, pinakipb2, poorvi767, rae89, trialiya, valeral.

Repeat Contributors

ANich, shubhamjagtap639, sgomezvillamor, siladitya2, skrydal, sumitappt, Masterchen09, mayurinehate, ngamanda, gaurav2733, githendrik, jayasimhankv.

DataHub Maintainers

anshbansal, asikowitz, chriscollins3456, darnaut, david-leifker, eboneil, ethan-cartwright, gabe-lyons, hsheth2, pedro93, RyanHolstien, treff7es, yoonhyejin.

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.13.0...v0.13.1

v0.13.0

2 months ago

v0.12.1

5 months ago

Release Highlights

New Features

SQLAlchemy Source Enhancements: Support for view lineage across all SQLAlchemy sources (PR #9039). Airflow Integration: Retry callback and support for ExternalTaskSensor subclasses added (PR #8514). Kafka Enhancements: Increased Kafka message size and enabled compression (PR #9038). JSONSchema Ingestion: Enabled schema-aware JsonSchemaTranslator (PR #8971). Search Bar Improvements: Added a flag to hide/display the autocomplete query (PR #9104). SQL Parser Performance: Enhancements and asyncio fixes (PR #9119). MongoDB Ingestion: Support for stateful ingestion and improved schema inference for lists (PR #9118, PR #9145). Policy Engine Updates: Refactoring and enhancements, including support for 10k+ policies (PR #9163, PR #9177). UI Enhancements: Numerous improvements including command-k icons in the search bar, updated Apollo cache, and auto-complete debounce in the search bar (PR #9194, PR #9193, PR #9205). Fivetran Integration: Connector integration for Fivetran (PR #9018). Neo4j Database Support: Connection to specific Neo4j databases now supported (PR #9179). Chart Subtypes in UI: Support for chart subtypes (PR #9186).

Fixes and Improvements

BigQuery Fixes: Resolved issues with lineage filter query, and fixed extracting comments from complex types (PR #9114, PR #8950). MongoDB Refactoring: Platform instance addition to MongoDB (PR #8663). Kafka Setup: Adjusted truststore settings for PEM files (PR #8656). REST API Authorization: Fixed rollback failure when authorization is enabled (PR #9092). Java Exception Handling: Addressed java.util.ConcurrentModificationException (PR #9090). UI and Documentation: Fixed filtering logic in UI, corrected documentation errors, and added feature guides (PR #9116, PR #9125, PR #9124, PR #9126, PR #9134, PR #9137, PR #9122, PR #9068). SQL Server and Snowflake Ingestion: Updated queries and fixed missing view downstream call (PR #9127, PR #8966). ClickHouse and DB2 Ingestion: Addressed column reflection regression and table properties handling (PR #9143, PR #9128). Ingestion Improvements: Numerous fixes and enhancements across various ingestion sources (PR #9153, PR #9155, PR #9141, PR #9157, PR #9123). CI and Build Process: Tweaked workflows, increased gradle retries, and addressed CI errors (PR #9052, PR #9091, PR #9160). Security Updates: Addressed a zookeeper CVE and other security concerns (PR #9190). UI Refactoring: Improved entity page loading indicators and renamed button texts (PR #9195, PR #9196). Policy and Auth Enhancements: Refactored policy locking and added roles to policy engine validation logic (PR #9178).

Testing and Continuous Integration

API Testing: Added tests for managing secrets, access token privilege, and flaky tests fix (PR #9121, PR #9167, PR #9132, PR #9175). Cypress Test Fixes: Addressed glossary navigation and download_lineage_results tests (PR #9175, PR #9132). Cleanup and Refactoring Ingestion Cleanup: Removed legacy memory_leak_detector and refactored ingestion sources (PR #9158, PR #9135, PR #9120, PR #9105). PDL Refactoring: Refactored Assertion model enums (PR #9191). Build and Deployment Release Preparation: Updated files for the 0.12.0 release (PR #9130).

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.12.0...v0.12.1

v0.12.1rc2

5 months ago

What's Changed

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.12.1...v0.12.1rc2

v0.12.0

6 months ago

v0.12.0 Release Highlights

User Experience

Nested Domains

Nested Domains are here! This provides flexibility in organizing your entities within Domains to match the unique organizational structure of your company. CleanShot 2023-10-27 at 14 30 43@2x

DataHub Chrome Extension Improvements

The Acryl DataHub Chome extension now supports PowerBI! This is a super powerful way for your business users to gain DataHub-specific insights directly in the BI tools they use most. Additionally, we now support making edits back to DataHub Entities directly from the Chrome extension.

Access Management Tab for Datasets

Shoutout to @Ramendra761 from the PayPal Team for contributing a new Access Management tab in Dataset Entity pages! The aim of this feature is to enable users to view the required roles for accessing the Dataset, as defined by Roles and/or Policies in the organization’s Access Management System. It also introduces the ability to request access directly from the page. CleanShot 2023-10-27 at 14 09 51@2x

Metadata Ingestion

Miscellaneous Improvements

  • Sampling-Based Profiling: You can now configure sampling-based profiling to address query performance concerns in Snowflake and BigQuery
  • Kafka Connect > Snowflake: We now support automatically defining lineage between the two platforms
  • Athena: Support for complex and nested schemas

Column-Level Lineage

We are incubating CLL support for the following:

  • Airflow plugin v2 now supports automatic extraction of CLL for certain operators, removing the need to annotate DAGs
  • dbt
  • Redshift
  • PowerBI (support for Column-Level Lineage for M-Query)

Incubating Sources

  • MLflow
  • Teradata
  • Unity Catalog Notebooks
  • DynamoDB

Developer Experience

  • Data Contracts: v0.12.0 introduces underlying models and CLI; UI support to follow
  • We now support creating custom models without requiring a fork of the main DataHub project
  • Updates to support OpenSearch 2.x and alternate Postgres db in postgres-setup

Other Notable Changes

  • Session token configuration has changed, all previously created session tokens will be invalid and users will be prompted to log in. Expiration time has also been shortened which may result in more login prompts with the default settings. There should be no other interruption due to this change.

Breaking Changes

Find full details here

  • #9044 - GraphQL APIs for adding ownership now expect either an ownershipTypeUrn referencing a customer ownership type or a (deprecated) type. Where before adding an ownership without a concrete type was allowed, this is no longer the case. For simplicity you can use the type parameter which will get translated to a custom ownership type internally if one exists for the type being added.
  • #9010 - In Redshift source's config incremental_lineage is set default to off.
  • #8810 - Removed support for SQLAlchemy 1.3.x. Only SQLAlchemy 1.4.x is supported now.
  • #8942 - Removed urn:li:corpuser:datahub owner for the Measure, Dimension and Temporal tags emitted by Looker and LookML source connectors.
  • #8853 - The Airflow plugin no longer supports Airflow 2.0.x or Python 3.7. See the docs for more details.
  • #8853 - Introduced the Airflow plugin v2. If you're using Airflow 2.3+, the v2 plugin will be enabled by default, and so you'll need to switch your requirements to include pip install 'acryl-datahub-airflow-plugin[plugin-v2]'. To continue using the v1 plugin, set the DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN environment variable to true.
  • #8943 - The Unity Catalog ingestion source has a new option include_metastore, which will cause all urns to be changed when disabled. This is currently enabled by default to preserve compatibility, but will be disabled by default and then removed in the future. If stateful ingestion is enabled, simply setting include_metastore: false will perform all required cleanup. Otherwise, we recommend soft deleting all databricks data via the DataHub CLI: datahub delete --platform databricks --soft and then reingesting with include_metastore: false.
  • #8846 - Changed enum values in resource filters used by policies. RESOURCE_TYPE became TYPE and RESOURCE_URN became URN. Any existing policies using these filters (i.e. defined for particular urns or types such as dataset) need to be upgraded manually, for example by retrieving their respective dataHubPolicyInfo aspect and changing part using filter i.e.
   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "RESOURCE_TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

into

   "resources": {
     "filter": {
       "criteria": [
         {
           "field": "TYPE",
           "condition": "EQUALS",
           "values": [
             "dataset"
           ]
         }
       ]
     }

for example, using datahub put command. Policies can also be removed and re-created via UI.

  • #9077 - The BigQuery ingestion source by default sets match_fully_qualified_names: true. This means that any dataset_pattern or schema_pattern specified will be matched on the fully qualified dataset name, i.e. <project_name>.<dataset_name>. We attempt to support the old pattern format by prepending .*\\. to dataset patterns lacking a period, so in most cases this should not cause any issues. However, if you have a complex dataset pattern, we recommend you manually convert it to the fully qualified format to avoid any potential issues.

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.11.0...v0.12.0

v0.11.0

8 months ago

Release Highlights

Potential Downtime

This release introduces substantial improvements to search ranking which require reindexing indices.

During the reindexing:

  • a system-update job will set indices to read-only and create a backup/clone of each index
  • new components will be prevented from start-up until the reindex completes
  • Helm deployments will go into read-only mode and new ingestion runs will fail

This process can take anywhere from 5 minutes to multiple hours; as a rough estimate, please expect it to take 1 hour for every 2.3 million entities. After the reindex is complete, please check your ingestion run to re-run any that did not complete.

User Experience

New Search and Browse Experience

We have some really exciting improvements to the DataHub user experience in this release! The new search and browse experience, which was first made available in the previous release behind a feature flag, is now on by default. Check out our release notes for v0.10.5 to get more information and documentation on this new Browse experience.

In addition to the ranking changes mentioned above, this release includes changes to the highlighting of search entities to understand why they match your query. You can also sort your results alphabetically or by last updated times, in addition to relevance. In this release, we suggest a correction if your query has a typo in it.

Manage Home Page Posts

In this release we now enable you to create and delete pinned announcements on your DataHub homepage! If you have the “Manage Home Page Posts” platform privilege you’ll see a new section in settings called “Home Page Posts” where you can create and delete text posts and link posts that your users see on the home page.

OpenAPI Endpoints Expanded

OpenAPI entity and aspect endpoints expanded to improve developer experience when using this API with additional aspects to be added in the near future.

Metadata ingestion

Added support for Confluent S3 Sink Connector, extracting stored procedures and jobs from mssql, and snowflake shares. Additionally, sql parsing source now converts query logs into CLL and usage.

Developer Experience

The CLI now supports recursive deletes.

Versioned documentation

Starting from this release, we support versioned documentation on the datahub docs site! Select the version you’re on and browse docs specifically at that version.

Performance Improvements

  • Batching of default aspects on initial ingestion (SQL)
  • Improvements to multi-threading. Ingestion recipes, if previously reduced to 1 thread, can be restored to the 15 thread default.
  • Gradle 7 upgrade moderately improves build speed
  • DataHub Ingestion slim images reduced in size by 2GB+

Important Bug Fixes

  • Glue Schema Registry fixed

Deprecation Notice

  • MAE Events are no longer produced. MAE events have been deprecated for over a year.

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.10.5...v0.11.0

v0.10.5

9 months ago

Release Highlights

NEW: Unified Search and Browse Experience

It’s here, it’s here! We are incredibly excited to roll out our re-designed, streamlined Search and Browse experience. End-users now have a one-stop-shop to search for specific data entities and browse across systems, making it easier than ever to find the most relevant and meaningful resources within DataHub.

Checkout the screenshot below and get a full walk-through in this video!

CleanShot 2023-08-03 at 14 47 55@2x

User Experience

  • Column-Level Lineage (CLL) visualization update: you can now visualize CLL relationships through DataJobs (i.e. Airflow DAGs)
  • Unique Glossary Terms: We now prevent creating duplicate Glossary Term names within a Term Group
  • Domains: You can now configure the Documentation tab to be the default landing page within a Domain
  • Formatting updates to Row Count to make large numbers more human readable (ie. 3283337 > 3.2M)
  • Stats Tab: Y-axis scale now dynamically set to reflect the minimum & maximum values, improving readability

Metadata ingestion

Ingestion Enhancements:

  • BigQuery: Set platform_instance using project_id
  • PowerBI: Ingest datasets not used in visualizations (tiles/pages
  • Kafka Connect: Ability to set platform_instance
  • Nifi: Support for basic auth
  • Presto on Hive: Extract all table properties from Hive Metastore
  • Elasticsearch: Support for basic profiling
  • Add advanced configuration for LDAP manager ingestion

Lineage Improvements:

  • Schema-aware SQL parsing to derive column-level lineage
  • Column-level lineage support for BigQuery, Tableau, and Snowflake View definitions
  • Snowflake: Extract Snowpipe S3 lineage

Developer Experience

  • Fine-grained ownership policies
  • PATCH support for DataJob Inputs/Outputs
  • New endpoints to extract size of time-series indices and truncate/cleanup time-series indices in Elasticsearch; support for bulk-deletes
  • Initial support for exception reporting via Sentry
  • New OpenAPI endpoint to get Task Status
  • SDK: Easily generate container URNs

Docs

  • Improvements to our File-Based Lineage doc, specifically focused on Fine-Grained Lineage config components (link)
  • Code examples of how to manage Posts within DataHub (link)
  • Guide to generating custom browse paths for the new search experience (link)

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.10.4...v0.10.5

v0.10.4

11 months ago

Release Highlights

User Experience

  • You can now create and assign Custom Ownership types within DataHub; plus, we now display the owner type on an Entity Page ownershiptype-displayed

  • Various bug fixes to Column Level Lineage visualization

Metadata ingestion

  • You can now define column-level lineage (aka fine-grained lineage) via our file-based lineage source
  • Looker: Ingest Looks that are not part of a Dashboard
  • Glue: Error reporting now includes lineage failures
  • BigQuery: Now support deduplicating LogEntries based on insertId, timestamp, and logName

Docs

  • CSV Enricher: improvements to sample CSV and recipe
  • Guide for changing default DataHub credentials
  • Updated guide to apply time-based filters on Lineage

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.10.3...v0.10.4

v0.10.3

11 months ago

Release Highlights

User Experience

  • Define Data Products via YAML and manage associated entities within a Domain
  • Search experience: quickly apply a filter at time of search
  • Form-based PowerBI ingestion

Developer Experience

  • Progress toward Removing Confluent Schema Registry requirement -- Helm & Quickstart simplifications to follow
    • NOTE: this will only work for new deployments of DataHub; If you have already deployed DataHub with Confluent Schema Registry, you will not be able to disable it
  • Delete CLI - correctly handles deleting timeseries aspects
  • Ongoing improvements to Quickstart stability
  • Support entity types filter in get_urns_by_filter
  • Search customization
    • regex based query matching
    • full control over scoring functions (useable on any document field, i.e. tags, deprecated flags, etc)
    • enable/disable fuzzy, prefix, exact match queries

Ingestion

  • BigQuery - Improve ingestion disk usage & speed; extract dataset usage from Views
  • Unity Catalog - Capture create/last modified timestamps; extract usage; data profiling support
  • PowerBI - Update workspace concept mapping; support modified_since, extract_dataset_schema, and more
  • Superset – support stateful ingestion
  • Business Glossary – Simplify ingestion source
  • Kafka – Add description in dataset properties
  • S3 – Support stateful ingestion & last_updated
  • CSV Enricher – Support updating more types
  • PII Classification - Configurable sample size
  • Nifi - Support Kerberos authentication

What's Changed

New Contributors

Full Changelog: https://github.com/datahub-project/datahub/compare/v0.10.2...v0.10.3