Delta Io Delta Versions Save

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

v3.1.0

2 months ago

We are excited to announce the release of Delta Lake 3.1.0. This release includes several exciting new features.

Few Highlights

  • Delta-Spark: Support for merge with deletion vectors to reduce the write overhead for merge operations. This feature improves the performance of merge by several folds.
  • Delta-Spark: Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
  • Delta-Spark: Support for querying tables shared through Delta Sharing protocol.
  • Kernel: Support for data skipping for given query predicates to reduce the number of files read during the table scan.
  • Uniform: Enhanced Iceberg support for Delta tables that enables MAP and LIST types and ease of use improvements to enable Uniform on a Delta table.
  • Delta-Flink: Flink write job startup time latency improvement using Kernel.

Details by each component.

Delta Spark

Delta Spark 3.1.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features of this release are:

  • Support for merge with deletion vectors to reduce the write overhead for merge operations. This feature improves the performance of merge by several folds. Refer to the documentation on deletion vectors for more information.
  • Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
  • (Preview) Liquid clustering for better table layout Now Delta allows clustering the data in a Delta table for better data skipping. Currently this is an experimental feature. See documentation and example for how to try out this feature.
  • Support for DEFAULT value columns. Delta supports defining default expressions for columns on Delta tables. Delta will generate default values for columns when users do not explicitly provide values for them when writing to such tables, or when the user explicitly specifies the DEFAULT SQL keyword for any such column. See documentation on how to enable this feature and try out.
  • Support for Hive Metastore schema sync. Adds a mechanism for syncing the table schema to HMS. External tools can now directly consume the schema from HMS instead of accessing it from the Delta table directory. See the documentation on how to enable this feature.
  • Auto compaction to address the small files problem during table writes. Auto compaction which runs at the end of the write query combines small files within partitions to large files to reduce the metadata size and improve query performance. See the documentation for details on how to enable this feature.
  • Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size and reduce the small file problem as data is written and benefit subsequent reads on the table. See the documentation for details on how to enable this feature.

Other notable changes include:

  • Peformance improvement by removing redundant jobs when performing DML operations with deletion vectors.
  • Update command now writes deletions vectors by default when the table has deletion vectors enabled.
  • Support for writing partition columns to data files.
  • Support for phaseout of v2 checkpoint table feature.
  • Fix an issue with case-sensitive column names in Merge.
  • Make VACCUM command to be Delta protocol aware so that it can only vacuum tables with protocol that it supports.

Delta Sharing Spark

This release of Delta adds a new module called delta-sharing-spark which enables reading Delta tables shared using the Delta Sharing protocol in Apache Spark™. It is migrated from https://github.com/delta-io/delta-sharing/tree/main/spark repository to https://github.com/delta-io/delta/tree/master/sharing repository. Last release version of delta-sharing-spark is 1.0.4 from the previous location. Next release of delta-sharing-spark is with the current release of Delta which is 3.1.0.

Supported read types are: read snapshot of the table, incrementally read the table using streaming or read the changes (Change Data Feed) between two versions of the table.

“Delta Format Sharing” is newly introduced since delta-sharing-spark 3.1, which supports reading shared Delta tables with advanced Delta features such as deletion vectors and column mapping.

Below is an example of reading a Delta table shared using the Delta Sharing protocol in a Spark environment. For more examples refer to the documentation.

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("...")
  .master("...")
  .config(
     "spark.sql.extensions",
      "io.delta.sql.DeltaSparkSessionExtension"
  ).config(
     "spark.sql.catalog.spark_catalog",
      "org.apache.spark.sql.delta.catalog.DeltaCatalog"
  ).getOrCreate()

val tablePath = "<profile-file-path>#<share-name>.<schema-name>.<table-name>"

// Batch query
spark.read
  .format("deltaSharing")
  .option("responseFormat", "delta")
  .load(tablePath)
  .show(10)

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) allows you to read Delta tables from Iceberg and Hudi (coming soon) clients. Delta 3.1.0 provided the following improvements:

  • Enhanced Iceberg support through IcebergCompatV2. IcebergCompatV2 adds support forLIST and MAP data types and improves compatibility with popular Iceberg reader clients.
  • Easier retrieval of the Iceberg metadata file location via familiar SQL syntax DESCRIBE EXTENDED TABLE.
  • A new SQL command to enable UniForm REORG TABLE table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)) on existing Delta tables. See the documentation for details.
  • Delta file statistics conversion to Iceberg including max/min/rowCount/nullCount which enables efficient data skipping when the tables are read as Iceberg in queries containing predicates.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).

  • Delta 3.0.0 released the first version of Kernel. In this release, read support is further enhanced and APIs are solidified by taking into account the feedback received from connectors trying out the first version of Kernel in Delta 3.0.0.
  • Support for data skipping for given query predicates. Now Kernel can prune the list of files to scan for a given query predicate using the file level statistics stored in the Delta metadata. This helps connectors read less data than usual.
  • Improved Delta table reconstruction latency. Kernel now can read load the initial protocol and metadata several times faster due to improved table state reconstruction.
  • Support for column mapping id mode. Now tables with column mapping id mode can be read by Kernel.
  • Support for slf4j logging

For more information, refer to:

  • User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
  • Slides explaining the rationale behind Kernel and the API design.
  • Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
  • Table and default TableClient API Java documentation

The key features of this release are

  • Flink write job startup time latency improvement using Kernel In this version, Flink has an option to use Kernel to load the Delta table metadata (i.e table schema) which helps the reduce the startup time by up to 45x. To enable this set io.delta.flink.kernel.enabled to true in the Hadoop configuration you pass when creating the Flink Sink.

Delta Standalone

There are no updates to Standalone in this release.

Credits

Ala Luszczak, Allison Portis, Ami Oka, Amogh Akshintala, Andreas Chatzistergiou, Bart Samwel, BjarkeTornager, Christos Stavrakakis, Costas Zarifis, Daniel Tenedorio, Dhruv Arya, EJ Song, Eric Maynard, Felipe Pessoto, Fred Storage Liu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Haejoon Lee, Hao Jiang, Jared Wang, Jiaheng Tang, Jing Wang, Johan Lasperas, Kaiqi Jin, Kam Cheung Ting, Lars Kroll, Li Haoyi, Lin Zhou, Lukas Rupprecht, Mark Jarvin, Max Gekk, Ming DAI, Nick Lanham, Ole Sasse, Paddy Xu, Patrick Leahey, Peter Toth, Prakhar Jain, Renan Tomazoni Pinzon, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Thang Long Vu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wei Luo, Wenchen Fan, Xin Zhao, jintao shen, panbingkun

v3.1.0rc3

2 months ago

Delta Lake 3.1.0

We are excited to announce the preview release of Delta Lake 3.1.0. This release includes several exciting new features.

Delta Spark

Delta Spark 3.1.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features of this release are:

The key features of this release are:

  • Support for merge with deletion vectors to reduce the write overhead for merge operations. This feature improves the performance of merge by several folds. Refer to the documentation on deletion vectors for more information.
  • Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
  • (Experimental) Liquid clustering for better table layout Now Delta allows clustering the data in a Delta table for better data skipping. Currently this is an experimental feature. See documentation and example for how to try out this feature.
  • Support for DEFAULT value columns. Delta supports defining default expressions for columns on Delta tables. Delta will generate default values for columns when users do not explicitly provide values for them when writing to such tables, or when the user explicitly specifies the DEFAULT SQL keyword for any such column. See documentation on how to enable this feature and try out.
  • Support for Hive Metastore schema sync. Adds a mechanism for syncing the table schema to HMS. External tools can now directly consume the schema from HMS instead of accessing it from the Delta table directory. See the documentation on how to enable this feature.
  • Auto compaction to address the small files problem during table writes. Auto compaction which runs at the end of the write query combines small files within partitions to large files to reduce the metadata size and improve query performance.
  • Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size and reduce the small file problem as data is written and benefit subsequent reads on the table.
  • Other notable changes include:
    • Peformance improvement by removing redundant jobs when performing DML operations with deletion vectors.
    • Update command now writes deletions vectors by default when the table has deletion vectors enabled.
    • Support for writing partition columns to data files.
    • Support for phaseout of v2 checkpoint table feature.
    • Fix an issue with case-sensitive column names in Merge.
    • Make VACCUM command to be Delta protocol aware so that it can only vacuum tables with protocol that it supports.

Delta Sharing Spark

This release of Delta adds a new module called delta-sharing-spark which enables reading Delta tables shared using the Delta Sharing protocol in Apache Spark™. It is migrated from https://github.com/delta-io/delta-sharing/tree/main/spark repository to https://github.com/delta-io/delta/tree/master/sharing repository. The last release version of delta-sharing-spark is 1.0.4 from the previous location. Next release of delta-sharing-spark is with the current release of Delta which is 3.1.0.

Supported read types are: read snapshot of the table, incrementally read the table using streaming or read the changes (Change Data Feed) between two versions of the table.

“Delta Format Sharing” is newly introduced since delta-sharing-spark 3.1, which supports reading shared Delta tables with advanced Delta features such as deletion vectors and column mapping.

Below is an example of reading a Delta table shared using the Delta Sharing protocol in a Spark environment. For more examples refer to the documentation.

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("...")
  .master("...")
  .config(
     "spark.sql.extensions",
      "io.delta.sql.DeltaSparkSessionExtension"
  ).config(
     "spark.sql.catalog.spark_catalog",
      "org.apache.spark.sql.delta.catalog.DeltaCatalog"
  ).getOrCreate()

val tablePath = "<profile-file-path>#<share-name>.<schema-name>.<table-name>"

// Batch query
spark.read
  .format("deltaSharing")
  .option("responseFormat", "delta")
  .load(tablePath)
  .show(10)

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) allows you to read Delta tables from Iceberg and Hudi (coming soon) clients. Delta 3.1.0 provided the following improvements:

  • Enhanced Iceberg support through IcebergCompatV2. IcebergCompatV2 adds support forLIST and MAP data types and improves compatibility with popular Iceberg reader clients.
  • Easier retrieval of the Iceberg metadata file location via familiar SQL syntax DESCRIBE EXTENDED TABLE.
  • A new SQL command to enable UniForm REORG TABLE table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)) on existing Delta tables.
  • Delta file statistics conversion to Iceberg including max/min/rowCount/nullCount which enables efficient data skipping when the tables are read as Iceberg in queries containing predicates.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).

Delta 3.0.0 released the first version of Kernel. In this release, read support is further enhanced and APIs are solidified by taking into account the feedback received from connectors trying out the first version Delta 3.0.0.

  • Support for data skipping for given query predicates. Now Kernel can prune the list of files to scan for a given query predicate using the file level statistics stored in the Delta metadata. This helps connectors read less data than usual.
  • Improved Delta table snapshot reconstruction latency. Kernel now can read load the initial protocol and metadata a lot faster due to improved table state reconstruction.
  • Support for column mapping id mode. Now tables with column mapping id mode can be read by Kernel.
  • Misc. API changes and bug fixes.

For more information, refer to:

  • User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
  • Slides explaining the rationale behind Kernel and the API design.
  • Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
  • Table and default TableClient API Java documentation

Delta-Flink 3.1.0 is built on top of Apache Flink™ 1.16.1.

The key features of this release are

  • Flink write job startup time latency improvement using Kernel In this version, Flink has an option to use Kernel to load the Delta table metadata (i.e table schema) which helps the reduce the startup time by up to 45x. To enable this set io.delta.flink.kernel.enabled to true in the Hadoop configuration you pass when creating the Flink Sink.

Delta Standalone

There are no updates to standalone in this release.

Credits

Ala Luszczak, Allison Portis, Ami Oka, Amogh Akshintala, Andreas Chatzistergiou, Bart Samwel, BjarkeTornager, Christos Stavrakakis, Costas Zarifis, Daniel Tenedorio, Dhruv Arya, EJ Song, Eric Maynard, Felipe Pessoto, Fred Storage Liu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Haejoon Lee, Hao Jiang, Jared Wang, Jiaheng Tang, Jing Wang, Johan Lasperas, Kaiqi Jin, Kam Cheung Ting, Lars Kroll, Li Haoyi, Lin Zhou, Lukas Rupprecht, Mark Jarvin, Max Gekk, Ming DAI, Nick Lanham, Ole Sasse, Paddy Xu, Patrick Leahey, Peter Toth, Prakhar Jain, Renan Tomazoni Pinzon, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Thang Long Vu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wei Luo, Wenchen Fan, Xin Zhao, jintao shen, panbingkun

How to use the preview release

Delta-Spark

Download Spark 3.5.0 from https://spark.apache.org/downloads.html

For this preview, we have published the artifacts to a staging repository. Here’s how you can use them:

spark-submit

Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1133 to the command line arguments. Example:

spark-submit --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories \
https://oss.sonatype.org/content/repositories/iodelta-1133 examples/examples.py

Currently, Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 3.1.0 by just providing the --packages io.delta:delta-spark_2.12:3.1.0 argument.

Spark-shell

bin/spark-shell --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1133 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Spark-SQL

bin/spark-sql --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1133 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Maven project

<repositories>
  <repository>
    <id>staging-repo</id>
    <url>https://oss.sonatype.org/content/repositories/iodelta-1133</url>
  </repository>
</repositories>
<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-spark_2.12</artifactId>
  <version>3.1.0</version>
</dependency>

SBT project

libraryDependencies += "io.delta" %% "delta-spark" % "3.1.0"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1133

Delta-spark PyPi:

  • Download two artifacts from pre-release here. Artifacts to download are:
    • delta-spark-3.1.0.tar.gz
    • delta_spark-3.1.0-py3-none-any.whl
  • Keep them in one directory. Lets call that ~/Downloads
  • pip install ~/Downloads/delta_spark-3.1.0-py3-none-any.whl
  • pip show delta-spark should show output similar to the below
Name: delta-spark
Version: 3.1.0
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: [email protected]
License: Apache-2.0
Location: <user-home>/.conda/envs/delta-release/lib/python3.8/site-packages
Requires: importlib-metadata, pyspark

v3.1.0rc2

2 months ago

Delta Lake 3.1.0

We are excited to announce the preview release of Delta Lake 3.1.0. This release includes several exciting new features.

Delta Spark

Delta Spark 3.1.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features of this release are:

  • Support for merge with deletion vectors to reduce the write overhead for merge operations. This features improves the performance of merge by several folds.
  • Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
  • (Experimental) Liquid clustering for better table layout. Now Delta allows clustering the data in a Delta table for better data skipping. Currently this is an experimental feature. See documentation and example for how to try out this feature.
  • Support for DEFAULT value columns. Delta supports defining default expressions for columns on Delta tables. Delta will generate default values for columns when users do not explicitly provide values for them when writing to such tables, or when the user explicitly specifies the DEFAULT SQL keyword for any such column. See documentation on how to enable this feature and try out.
  • Support for Hive Metastore schema sync. Adds a post-commit hook for syncing the table schema and properties to HMS (or compatible to HMS such as AWS Glue) whenever they change. See the documentation on how to enable this feature.
  • Query Delta sharing tables from Delta-Spark. Now Delta-Spark allows querying Delta tables shared using Delta Sharing protocol. Queries include batch queries, streaming queries and CDF queries. Delta tables with deletion vectors or column mapping enabled can also be shared and read in Delta-Spark. See documentation for further details.
  • Auto compaction to address the small files problem during table writes. Auto compaction which runs at the end of the write query combines small files within partitions to large files to reduce the metadata size and improve query performance.
  • Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size as data is written and benefit subsequent reads on the table.
  • Other notable changes include
    • Support for writing partition columns to data files.
    • Support for phaseout of v2 checkpoint table feature.
    • Fix an issues with case-sensitive column names in Merge.
    • Fix an issue with the VACCUM command to be Delta protocol complaint so that the commands garbage collects the files that are truly not needed anymore.

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Delta 3.1.0 provided the following improvements:

  • Enhanced Iceberg support called IcebergCompatV2, which supports List/Map and also improved compatibility with writing timestamp as int64 per Iceberg spec.
  • A new SQL command REORG TABLE table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)) to upgrade existing Delta tables to Uniform.
  • Delta file statistics conversion to Iceberg including max/min/rowCount/nullCount.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).

Delta 3.0.0 released the first version of Kernel. In this release, read support is further enhanced and APIs are solidified by taking into account the feedback received from connectors trying out the first version Delta 3.0.0.

  • Support for data skipping for given query predicates. Now Kernel can prune the list of files to scan for a given query predicate using the file level statistics stored in the Delta metadata. This helps connectors read less data than usual.
  • Improved Delta table snapshot reconstruction latency. Kernel now can read load the initial protocol and metadata a lot faster due to improved table state reconstruction.
  • Support for column mapping id mode. Now tables with column mapping id mode can be read by Kernel.
  • Misc. API changes and bug fixes.

For more information, refer to:

  • User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
  • Slides explaining the rationale behind Kernel and the API design.
  • Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
  • Table and default TableClient API Java documentation

Delta-Flink 3.1.0 is built on top of Apache Flink™ 1.16.1.

The key features of this release are

  • Flink write job startup time latency improvement using Kernel In this version. Flink has an option to use Kernel to load the Delta table metadata (i.e table schema) which helps the reduce the startup time by up to 45x.

Delta Standalone

There are no updates to standalone in this release.

Credits

Ala Luszczak, Allison Portis, Ami Oka, Amogh Akshintala, Andreas Chatzistergiou, Bart Samwel, BjarkeTornager, Christos Stavrakakis, Costas Zarifis, Daniel Tenedorio, Dhruv Arya, EJ Song, Eric Maynard, Felipe Pessoto, Fred Storage Liu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Haejoon Lee, Hao Jiang, Jared Wang, Jiaheng Tang, Jing Wang, Johan Lasperas, Kaiqi Jin, Kam Cheung Ting, Lars Kroll, Li Haoyi, Lin Zhou, Lukas Rupprecht, Mark Jarvin, Max Gekk, Ming DAI, Nick Lanham, Ole Sasse, Paddy Xu, Patrick Leahey, Peter Toth, Prakhar Jain, Renan Tomazoni Pinzon, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Thang Long Vu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wei Luo, Wenchen Fan, Xin Zhao, ericm-db, jintao shen, panbingkun

How to use the preview release

Delta-Spark

Download Spark 3.5.0 from https://spark.apache.org/downloads.html

For this preview, we have published the artifacts to a staging repository. Here’s how you can use them:

spark-submit

Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1132 to the command line arguments. Example:

spark-submit --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories \
https://oss.sonatype.org/content/repositories/iodelta-1132 examples/examples.py

Currently, Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 3.1.0 by just providing the --packages io.delta:delta-spark_2.12:3.1.0 argument.

Spark-shell

bin/spark-shell --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1132 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Spark-SQL

bin/spark-sql --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1132 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Maven project

<repositories>
  <repository>
    <id>staging-repo</id>
    <url>https://oss.sonatype.org/content/repositories/iodelta-1132</url>
  </repository>
</repositories>
<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-spark_2.12</artifactId>
  <version>3.1.0</version>
</dependency>

SBT project

libraryDependencies += "io.delta" %% "delta-spark" % "3.1.0"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1132

Delta-spark PyPi:

  • Download two artifacts from pre-release here. Artifacts to download are:
    • delta-spark-3.1.0.tar.gz
    • delta_spark-3.1.0-py3-none-any.whl
  • Keep them in one directory. Lets call that ~/Downloads
  • pip install ~/Downloads/delta_spark-3.1.0-py3-none-any.whl
  • pip show delta-spark should show output similar to the below
Name: delta-spark
Version: 3.1.0
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: [email protected]
License: Apache-2.0
Location: <user-home>/.conda/envs/delta-release/lib/python3.8/site-packages
Requires: importlib-metadata, pyspark

v3.1.0rc1

3 months ago

Delta Lake 3.1.0

We are excited to announce the preview release of Delta Lake 3.1.0. This release includes several exciting new features.

Delta Spark

Delta Spark 3.1.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features of this release are:

  • Support for merge with deletion vectors to reduce the write overhead for merge operations. This features improves the performance of merge by several folds.
  • Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
  • (Experimental) Liquid clustering for better table layout. Now Delta allows clustering the data in a Delta table for better data skipping. Currently this is an experimental feature. See documentation and example for how to try out this feature.
  • Support for DEFAULT value columns. Delta supports defining default expressions for columns on Delta tables. Delta will generate default values for columns when users do not explicitly provide values for them when writing to such tables, or when the user explicitly specifies the DEFAULT SQL keyword for any such column. See documentation on how to enable this feature and try out.
  • Support for Hive Metastore schema sync. Adds a post-commit hook for syncing the table schema and properties to HMS (or compatible to HMS such as AWS Glue) whenever they change. See the documentation on how to enable this feature.
  • Query Delta sharing tables from Delta-Spark. Now Delta-Spark allows querying Delta tables shared using Delta Sharing protocol. Queries include batch queries, streaming queries and CDF queries. Delta tables with deletion vectors or column mapping enabled can also be shared and read in Delta-Spark. See documentation for further details.
  • Auto compaction to address the small files problem during table writes. Auto compaction which runs at the end of the write query combines small files within partitions to large files to reduce the metadata size and improve query performance.
  • Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size as data is written and benefit subsequent reads on the table.
  • Other notable changes include
    • Support for writing partition columns to data files.
    • Support for phaseout of v2 checkpoint table feature.
    • Fix an issues with case-sensitive column names in Merge.

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Delta 3.1.0 provided the following improvements:

  • Enhanced Iceberg support called IcebergCompatV2, which supports List/Map and also improved compatibility with writing timestamp as int64 per Iceberg spec.
  • A new SQL command REORG TABLE table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)) to upgrade existing Delta tables to Uniform.
  • Delta file statistics conversion to Iceberg including max/min/rowCount/nullCount.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).

Delta 3.0.0 released the first version of Kernel. In this release, read support is further enhanced and APIs are solidified by taking into account the feedback received from connectors trying out the first version Delta 3.0.0.

  • Support for data skipping for given query predicates. Now Kernel can prune the list of files to scan for a given query predicate using the file level statistics stored in the Delta metadata. This helps connectors read less data than usual.
  • Improved Delta table snapshot reconstruction latency. Kernel now can read load the initial protocol and metadata a lot faster due to improved table state reconstruction.
  • Support for column mapping id mode. Now tables with column mapping id mode can be read by Kernel.
  • Misc. API changes and bug fixes.

For more information, refer to:

  • User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
  • Slides explaining the rationale behind Kernel and the API design.
  • Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
  • Table and default TableClient API Java documentation

Delta-Flink 3.1.0 is built on top of Apache Flink™ 1.16.1.

The key features of this release are

  • Flink write job startup time latency improvement using Kernel In this version. Flink has an option to use Kernel to load the Delta table metadata (i.e table schema) which helps the reduce the startup time by up to 45x.

Delta Standalone

There are no updates to standalone in this release.

Credits

Ala Luszczak, Allison Portis, Ami Oka, Amogh Akshintala, Andreas Chatzistergiou, Bart Samwel, BjarkeTornager, Christos Stavrakakis, Costas Zarifis, Daniel Tenedorio, Dhruv Arya, EJ Song, Eric Maynard, Felipe Pessoto, Fred Storage Liu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Haejoon Lee, Hao Jiang, Jared Wang, Jiaheng Tang, Jing Wang, Johan Lasperas, Kaiqi Jin, Kam Cheung Ting, Lars Kroll, Li Haoyi, Lin Zhou, Lukas Rupprecht, Mark Jarvin, Max Gekk, Ming DAI, Nick Lanham, Ole Sasse, Paddy Xu, Patrick Leahey, Peter Toth, Prakhar Jain, Renan Tomazoni Pinzon, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Thang Long Vu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wei Luo, Wenchen Fan, Xin Zhao, ericm-db, jintao shen, panbingkun

How to use the preview release

Delta-Spark

Download Spark 3.5.0 from https://spark.apache.org/downloads.html

For this preview, we have published the artifacts to a staging repository. Here’s how you can use them:

spark-submit

Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1129 to the command line arguments. Example:

spark-submit --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories \
https://oss.sonatype.org/content/repositories/iodelta-1129 examples/examples.py

Currently, Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 3.1.0 by just providing the --packages io.delta:delta-spark_2.12:3.1.0 argument.

Spark-shell

bin/spark-shell --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1129 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Spark-SQL

bin/spark-sql --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1129 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Maven project

<repositories>
  <repository>
    <id>staging-repo</id>
    <url>https://oss.sonatype.org/content/repositories/iodelta-1129</url>
  </repository>
</repositories>
<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-spark_2.12</artifactId>
  <version>3.1.0</version>
</dependency>

SBT project

libraryDependencies += "io.delta" %% "delta-spark" % "3.1.0"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1129

Delta-spark PyPi:

  • Download two artifacts from pre-release here. Artifacts to download are:
    • delta-spark-3.1.0.tar.gz
    • delta_spark-3.1.0-py3-none-any.whl
  • Keep them in one directory. Lets call that ~/Downloads
  • pip install ~/Downloads/delta_spark-3.1.0-py3-none-any.whl
  • pip show delta-spark should show output similar to the below
Name: delta-spark
Version: 3.1.0
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: [email protected]
License: Apache-2.0
Location: <user-home>/.conda/envs/delta-release/lib/python3.8/site-packages
Requires: importlib-metadata, pyspark

v3.0.0

6 months ago

We are excited to announce the final release of Delta Lake 3.0.0. This release includes several exciting new features and artifacts.

Highlights

Here are the most important aspects of 3.0.0:

Spark 3.5 Support

Unlike the initial preview release, Delta Spark is now built on top of Apache Spark™ 3.5. See the Delta Spark section below for more details.

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Iceberg support is available with this release. UniForm takes advantage of the fact that all table storage formats, such as Delta, Iceberg, and Hudi, actually consist of Parquet data files and a metadata layer. In this release, UniForm automatically generates Iceberg metadata and commits to Hive metastore, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. Create a UniForm-enabled table using the following command:

CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES (
  'delta.universalFormat.enabledFormats' = 'iceberg');

Every write to this table will automatically keep Iceberg metadata updated. See the documentation here for more details, and the key implementations here and here.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).

You can use this library to do the following:

  • Read data from Delta tables in a single thread in a single process.
  • Read data from Delta tables using multiple threads in a single process.
  • Build a complex connector for a distributed processing engine and read very large Delta tables.
  • [soon!] Write to Delta tables from multiple threads / processes / distributed engines.

Reading a Delta table with Kernel APIs is as follows.

TableClient myTableClient = DefaultTableClient.create() ;          // define a client
Table myTable = Table.forPath(myTableClient, "/delta/table/path"); // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient);    // define which version of table to scan
Predicate scanFilter = ...                                         // define the predicate
Scan myScan = mySnapshot.getScanBuilder(myTableClient)             // specify the scan details
        .withFilters(scanFilter)
        .build();
Scan.readData(...)                                                 // returns the table data 

Full example code can be found here.

For more information, refer to:

  • User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
  • Slides explaining the rationale behind Kernel and the API design.
  • Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
  • Table and default TableClient API Java documentation

This release of Delta contains the Kernel Table API and default TableClient API definitions and implementation which allow:

  • Reading Delta tables with optional Deletion Vectors enabled or column mapping (name mode only) enabled.
  • Partition pruning optimization to reduce the number of data files to read.

Welcome Delta Connectors to the Delta repository!

All previous connectors from https://github.com/delta-io/connectors have been moved to this repository (https://github.com/delta-io/delta) as we aim to unify our Delta connector ecosystem structure. This includes Delta-Standalone, Delta-Flink, Delta-Hive, PowerBI, and SQL-Delta-Import. The repository https://github.com/delta-io/connectors is now deprecated.

Delta Spark

Delta Spark 3.0.0 is built on top of Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. Note that the Delta Spark maven artifact has been renamed from delta-core to delta-spark.

The key features of this release are:

Other notable changes include

  • Fix for a bug in MERGE statements that contain a scalar subquery with non-deterministic results. Such a subquery can return different results during source materialization, while finding matches, and while writing modified rows. This can cause rows to be either dropped or duplicated.
  • Fix for potential resource leak when DV file not found during parquet read
  • Support protocol version downgrade
  • Fix to initial preview release to support converting null partition values in UniForm
  • Fix to WRITE command to not commit empty transactions, just like what DELETE, UPDATE, and MERGE commands do already
  • Support 3-part table name identifier. Now, commands like OPTIMIZE <catalog>.<db>.<tbl> will work.
  • Performance improvement to CDF read queries scanning in batch to reduce the number of cloud requests and to reduce Spark scheduler pressure
  • Fix for edge case in CDF read query optimization due to incorrect statistic value
  • Fix for edge case in streaming reads where having the same file with different DVs in the same batch would yield incorrect results as the wrong file and DV pair would be read
  • Prevent table corruption by disallowing overwriteSchema when partitionOverwriteMode is set to dynamic
  • Fix a bug where DELETE with DVs would not work on Column Mapping-enabled tables
  • Support automatic schema evolution in structs that are inside maps
  • Minor fix to Delta table path URI concatenation
  • Support writing parquet data files to the data subdirectory via the SQL configuration spark.databricks.delta.write.dataFilesToSubdir. This is used to add UniForm support on BigQuery.

Delta-Flink 3.0.0 is built on top of Apache Flink™ 1.16.1.

The key features of this release are

  • Support for Flink SQL and Catalog. You can now use the Flink/Delta connector for Flink SQL jobs. You can CREATE Delta tables, SELECT data from them (uses the Delta Source), and INSERT new data into them (uses the Delta Sink). Note: for correct operations on the Delta tables, you must first configure the Delta Catalog using CREATE CATALOG before running a SQL command on Delta tables. For more information, please see the documentation here.
  • Significant performance improvement to Global Committer initialization - The last-successfully-committed delta version by a given Flink application is now loaded lazily significantly reducing the CPU utilization in the most common scenarios.

Other notable changes include

  • Fix a bug where Flink STRING types were incorrectly truncated to type VARCHAR with length 1

Delta Standalone

The key features in this release are:

  • Support for disabling Delta checkpointing during commits - For very large tables with millions of files, performing Delta checkpoints can become an expensive overhead during writes. Users can now disable this checkpointing by setting the hadoop configuration property io.delta.standalone.checkpointing.enabled to false. This is only safe and suggested to do if another job will periodically perform the checkpointing.
  • Performance improvement to snapshot initialization - When a delta table is loaded at a particular version, the snapshot must contain, at a minimum, the latest protocol and metadata. This PR improves the snapshot load performance for repeated table changes.
  • Support adding absolute paths to the Delta log - This now enables users to manually perform SHALLOW CLONEs and create Delta tables with external files.
  • Fix in schema evolution to prevent adding non-nullable columns to existing Delta tables

Credits

Adam Binford, Ahir Reddy, Ala Luszczak, Alex, Allen Reese, Allison Portis, Ami Oka, Andreas Chatzistergiou, Animesh Kashyap, Anonymous, Antoine Amend, Bart Samwel, Bo Gao, Boyang Jerry Peng, Burak Yavuz, CabbageCollector, Carmen Kwan, ChengJi-db, Christopher Watford, Christos Stavrakakis, Costas Zarifis, Denny Lee, Desmond Cheong, Dhruv Arya, Eric Maynard, Eric Ogren, Felipe Pessoto, Feng Zhu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Gopi Krishna Madabhushi, Grzegorz Kołakowski, Hang Jia, Hao Jiang, Herivelton Andreassa, Herman van Hovell, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jiaheng Tang, Jiawei Bao, Jing Wang, Johan Lasperas, Jonas Irgens Kylling, Jungtaek Lim, Junyong Lee, K.I. (Dennis) Jung, Kam Cheung Ting, Krzysztof Chmielewski, Lars Kroll, Lin Ma, Lin Zhou, Luca Menichetti, Lukas Rupprecht, Martin Grund, Min Yang, Ming DAI, Mohamed Zait, Neil Ramaswamy, Ole Sasse, Olivier NOUGUIER, Pablo Flores, Paddy Xu, Patrick Pichler, Paweł Kubit, Prakhar Jain, Pulkit Singhal, RunyaoChen, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Siying Dong, Son, Tathagata Das, Terry Kim, Tom van Bussel, Venki Korukanti, Wenchen Fan, Xinyi, Yann Byron, Yaohua Zhao, Yijia Cui, Yuhong Chen, Yuming Wang, Yuya Ebihara, Zhen Li, aokolnychyi, gurunath, jintao shen, maryannxue, noelo, panbingkun, windpiger, wwang-talend, sherlockbeard

v3.0.0rc1

9 months ago

We are excited to announce the preview release of Delta Lake 3.0.0. This release includes several exciting new features and artifacts.

Highlights

Here are the most important aspects of 3.0.0.

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Iceberg support is available with this preview and Hudi will be coming soon. UniForm takes advantage of the fact that all table storage formats (Delta, Iceberg, and Hudi) actually consist of Parquet data files and a metadata layer. In this release, UniForm automatically generates Iceberg metadata, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. Create an UniForm-enabled table using the following command:

CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES (
  'delta.universalFormat.enabledFormats' = 'iceberg');

Every write to this table will automatically keep Iceberg metadata updated. See the documentation here for more details.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon) for building Delta connectors that can read (and soon, write to) Delta tables without the need to understand the Delta protocol details).

You can use this library to do the following:

  • Read data from small Delta tables in a single thread in a single process.
  • Read data from large Delta tables using multiple threads in a single process.
  • Build a complex connector for a distributed processing engine and read very large Delta tables.
  • [soon!] Write to Delta tables from multiple threads / processes / distributed engines.

Here is an example of a simple table scan with a filter:

TableClient myTableClient = DefaultTableClient.create() ;        // define a client (more details below)
Table myTable = Table.forPath("/delta/table/path");              // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient);  // define which version of table to scan
Scan myScan = mySnapshot.getScanBuilder(myTableClient)           // specify the scan details
        .withFilters(scanFilter)
        .build();
Scan.readData(...)                                               // returns the table data 

For more information, refer to Delta Kernel Github docs.

Delta Connectors: welcome to the Delta repository!

All previous connectors from https://github.com/delta-io/connectors have been moved to this repository (https://github.com/delta-io/delta) as we aim to unify our Delta connector ecosystem structure. This includes Delta-Standalone, Delta-Flink, Delta-Hive, PowerBI, and SQL-Delta-Import. The repository https://github.com/delta-io/connectors is now deprecated.

Delta Spark

Delta Spark 3.0.0 is built on top of Apache Spark™ 3.4. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. Note that the Delta Spark maven artifact has been renamed from delta-core to delta-spark.

The key features of this release are

  • Delta Universal Format. Write as Delta, read as Iceberg! See the highlighted section above.
  • Up to 2x faster MERGE operation . MERGE now better leverages data skipping, the ability to use the insert-only code path in more cases, and an overall improved execution to achieve up to 2x better performance in various scenarios.
  • Performance of DELETE using Deletion Vectors improved by more than 2x. This fix improves the file path canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table).
  • Support streaming reads from column mapping enabled tables when DROP COLUMN and RENAME COLUMN have been used. This includes streaming support for Change Data Feed. See the documentation here for more details.
  • Support specifying the columns for which Delta will collect file-skipping statistics via the table property delta.dataSkippingStatsColumns. Previously, Delta would only collect file-skipping statistics for the first N columns in the table schema (default to 32). Now, users can easily customize this.
  • Support zero-copy convert to Delta from Iceberg tables on Apache Spark 3.4 using CONVERT TO DELTA. This feature was excluded from the Delta Lake 2.4 release since Iceberg did not yet support Apache Spark 3.4. This command generates a Delta table in the same location and does not rewrite any parquet files.

Other notable changes include

  • Minor fix to Delta table path URI concatenation
  • Support writing parquet data files to the data subdirectory via the SQL configuration spark.databricks.delta.write.dataFilesToSubdir. This is used to add UniForm support on BigQuery.

Delta-Flink 3.0.0 is built on top of Apache Flink™ 1.16.1.

The key features of this release are

  • Support for Flink SQL and Catalog. You can now use the Flink/Delta connector for Flink SQL jobs. You can CREATE Delta tables, SELECT data from them (uses the Delta Source), and INSERT new data into them (uses the Delta Sink). Note: for correct operations on the Delta tables, you must first configure the Delta Catalog using CREATE CATALOG before running a SQL command on Delta tables. For more information, please see the documentation here.
  • Significant performance improvement to Global Committer initialization. The last-successfully-committed delta version by a given Flink application is now loaded lazily, significantly reducing the CPU utilization in the most common scenarios.

Delta Standalone

The key features in this release are:

  • Support for disabling Delta checkpointing during commits. For very large tables with millions of files, performing Delta checkpoints can become an expensive overhead during writes. Users can now disable this checkpointing by setting the hadoop configuration property io.delta.standalone.checkpointing.enabled to false. This is only safe and suggested to do if another job will periodically perform the checkpointing.
  • Performance improvement to snapshot initialization. When a delta table is loaded at a particular version, the snapshot must contain, at a minimum, the latest protocol and metadata. This PR improves the snapshot load performance for repeated table changes.
  • Support adding absolute paths to the Delta log. This now enables users to manually perform SHALLOW CLONEs and create Delta tables with external files.
  • Fix in schema evolution to prevent adding non-nullable columns to existing Delta tables
  • Dropped support for Scala 2.11. Due to lack to community demand and very low number of downloads, we have dropped Scala 2.11 support.

Liquid Partitioning

Liquid Clustering, a new effort to revamp how clustering works in Delta, which addresses the shortcomings of Hive-style partitioning and current ZORDER clustering. This feature will be available to preview soon; meanwhile, for more information, please refer to Liquid Clustering #1874.

Credits

Ahir Reddy, Ala Luszczak, Alex, Allen Reese, Allison Portis, Antoine Amend, Bart Samwel, Boyang Jerry Peng, CabbageCollector, Carmen Kwan, Christos Stavrakakis, Denny Lee, Desmond Cheong, Eric Ogren, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gerhard Brueckl, Gopi Krishna Madabhushi, Grzegorz Kołakowski, Herivelton Andreassa, Jackie Zhang, Jiaheng Tang, Johan Lasperas, Junyong Lee, K.I. (Dennis) Jung, Kam Cheung Ting, Krzysztof Chmielewski, Lars Kroll, Lin Ma, Luca Menichetti, Lukas Rupprecht, Ming DAI, Mohamed Zait, Ole Sasse, Olivier Nouguier, Pablo Flores, Paddy Xu, Patrick Pichler, Paweł Kubit, Prakhar Jain, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Siying Dong, Son, Tathagata Das, Terry Kim, Tom van Bussel, Venki Korukanti, Wenchen Fan, Yann Byron, Yaohua Zhao, Yuhong Chen, Yuming Wang, Yuya Ebihara, aokolnychyi, gurunath, jintao shen, maryannxue, noelo, panbingkun, windpiger, wwang-talend

v2.4.0

10 months ago

We are excited to announce the release of Delta Lake 2.4.0 on Apache Spark 3.4. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features in this release are as follows

  • Support for Apache Spark 3.4.
  • Support writing Deletion Vectors for the DELETE command. Previously, when deleting rows from a Delta table, any file with at least one matching row would be rewritten. With Deletion Vectors these expensive rewrites can be avoided. See What are deletion vectors? for more details.
  • Support for all write operations on tables with Deletion Vectors enabled.
  • Support PURGE to remove Deletion Vectors from the current version of a Delta table by rewriting any data files with deletion vectors. See the documentation for more details.
  • Support reading Change Data Feed for tables with Deletion Vectors enabled.
  • Support REPLACE WHERE expressions in SQL to selectively overwrite data. Previously “replaceWhere” options were only supported in the DataFrameWriter APIs.
  • Support WHEN NOT MATCHED BY SOURCE clauses in SQL for the Merge command.
  • Support omitting generated columns from the column list for SQL INSERT INTO queries. Delta will automatically generate the values for any unspecified generated columns.
  • Support the TimestampNTZ data type added in Spark 3.3. Using TimestampNTZ requires a Delta protocol upgrade; see the documentation for more information.
  • Other notable changes
    • Increased resiliency for S3 multi-cluster reads and writes.
      • Use a per-JVM lock to minimize the number of concurrent recovery attempts. Concurrent recoveries may cause concurrent readers to see a RemoteFileChangedException.
      • Catch any RemoteFileChangedException in the reader and retry reading.
    • Allow changing the column type of a char or varchar column to a compatible type in the ALTER TABLE command. The new behavior is the same as in Apache Spark and allows upcasting from char or varchar to varchar or string.
    • Block using overwriteSchema with dynamic partition overwrite. This can corrupt the table as not all the data may be removed, and the schema of the newly written partitions may not match the schema of the unchanged partitions.
    • Return an empty DataFrame for Change Data Feed reads when there are no commits within the timestamp range provided. Previously an error would be thrown.
    • Fix a bug in Change Data Feed reads for records created during the ambiguous hour when daylight savings occurs.
    • Fix a bug where querying an external Delta table at the root of an S3 bucket would throw an error.
    • Remove leaked internal Spark metadata from the Delta log to make any affected tables readable again.

Note: the Delta Lake 2.4.0 release does not include the Iceberg to Delta converter because iceberg-spark-runtime does not support Spark 3.4 yet. The Iceberg to Delta converter is still supported when using Delta 2.3 with Spark 3.3.

Credits

Alkis Evlogimenos, Allison Portis, Andreas Chatzistergiou, Anton Okolnychyi, Bart Samwel, Bo Gao, Carl Fu, Chaoqin Li, Christos Stavrakakis, David Lewis, Desmond Cheong, Dhruv Shah, Eric Maynard, Fred Liu, Fredrik Klauss, Haejoon Lee, Hussein Nagree, Jackie Zhang, Jintian Liang, Johan Lasperas, Lars Kroll, Lukas Rupprecht, Matthew Powers, Ming DAI, Ming Dai, Naga Raju Bhanoori, Paddy Xu, Prakhar Jain, Rahul Shivu Mahadev, Rui Wang, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wenchen Fan, Xi Liang, Yaohua Zhao, Yuming Wang

v2.3.0

1 year ago

We are excited to announce the release of Delta Lake 2.3.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features in this release are as follows

  • Zero-copy convert to Delta from Iceberg tables using CONVERT TO DELTA. This generates a Delta table in the same location and does not rewrite any parquet files. See the documentation for details.
  • Support SHALLOW CLONE for Delta, Parquet, and Iceberg tables to clone a source table without copying the data files. SHALLOW CLONE creates a copy of the source table’s definition but refers to the source table’s data files.
  • Support idempotent writes for DML operations. This feature adds idempotency to INSERT/DELETE/UPDATE/MERGE etc. operations using SQL configurations spark.databricks.delta.write.txnAppId and spark.databricks.delta.write.txnVersion.
  • Support “when not matched by source” clauses for the Merge command to update or delete rows in the chosen table that don’t have matches in the source table based on the merge condition. This clause is supported in the Python, Scala, and Java DeltaTable APIs. SQL Support will be added in Spark 3.4.
  • Support CREATE TABLE LIKE to create empty Delta tables using the definition and metadata of an existing table or view.
  • Support reading Change Data Feed (CDF) in SQL queries using the table_changes table-valued function.
  • Unblock Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have been used. See the documentation for more details.
  • Improved read and write performance on S3 when writing from a single cluster. Efficient file listing decreases the metadata processing time when calculating a table snapshot. This is most impactful for tables with many commits. Set the Hadoop configuration delta.enableFastS3AListFrom to true to enable it.
  • Record VACUUM operations in the transaction log. With this feature, VACUUM operations and their associated metrics (e.g. numDeletedFiles) will now show up in table history.
  • Support reading Delta tables with deletion vectors.
  • Other notable changes
    • Support schema evolution in MERGE for UPDATE SET <assignments> and INSERT (...) VALUES (...) actions. Previously, schema evolution was only supported for UPDATE SET * and INSERT * actions.
    • Add .show() support for COUNT(*) aggregate pushdown.
    • Enforce idempotent writes for df.saveAsTable for overwrite and append mode.
    • Support Table Features to selectively add individual features when upgrading the table protocol version. This enables users to only add active features and will facilitate connectivity as downstream Delta connectors can selectively implement feature support.
    • Automatically generate partition filters for additional generation expressions.
    • Block protocol downgrades when replacing a Delta table to prevent any incorrect time-travel or CDF queries.
    • Fix replaceWhere with the DataFrame V2 overwrite API to correctly evaluate less than conditions.
    • Fix dynamic partition overwrite for tables with more than one partition data type.
    • Fix schema evolution for INSERT OVERWRITE with complex data types when the source schema is read incompatible.
    • Fix Delta streaming source to correctly detect read-incompatible schema changes during backfill when there is exactly one schema change in the versions read.
    • Fix a bug in VACUUM where sometimes the default retention period was used to remove files instead of the retention period specified in the table properties.
    • Include the table name in the DataFrame returned by the deltaTable.details() Python/Scala/Java API.
    • Improve the log message for VACUUM table_name DRY RUN.

Credits

Allison Portis, Andreas Chatzistergiou, Andrew Li, Bo Zhang, Brayan Jules, Burak Yavuz, Christos Stavrakakis, Daniel Tenedorio, Dhruv Shah, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gengliang Wang, Haejoon Lee, Hussein Nagree, Jackie Zhang, Jiaheng Tang, Jintian Liang, Johan Lasperas, Jungtaek Lim, Kam Cheung Ting, Koki Otsuka, Lars Kroll, Lin Ma, Lukas Rupprecht, Ming DAI, Mitchell Riley, Ole Sasse, Paddy Xu, Prakhar Jain, Pranav, Rahul Shivu Mahadev, Rajesh Parangi, Ryan Johnson, Scott Sandre, Serge Rielau, Shixiong Zhu, Slim Ouertani, Tobias Fabritz, Tom van Bussel, Tushar Machavolu, Tyson Condie, Venki Korukanti, Vitalii Li, Wenchen Fan, Xinyi Yu, Yaohua Zhao, Yingyi Bu

v2.3.0rc1

1 year ago

We are excited to announce the preview release of Delta Lake 2.3.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features in this release are as follows

  • Zero-copy convert to Delta from Iceberg tables using CONVERT TO DELTA. This generates a Delta table in the same location and does not rewrite any parquet files.
  • Support SHALLOW CLONE for Delta, Parquet, and Iceberg tables to clone a source table without copying the data files. SHALLOW CLONE creates a copy of the source table’s definition but refers to the source table’s data files.
  • Support idempotent writes for DML operations. This feature adds idempotency to INSERT/DELETE/UPDATE/MERGE etc. operations using SQL configurations spark.databricks.delta.write.txnAppId and spark.databricks.delta.write.txnVersion.
  • Support “when not matched by source” clauses for the Merge command to update or delete rows in the chosen table that don’t have matches in the source table based on the merge condition. This clause is supported in the Python, Scala, and Java DeltaTable APIs. SQL Support will be added in Spark 3.4.
  • Support CREATE TABLE LIKE to create empty Delta tables using the definition and metadata of an existing table or view.
  • Support reading Change Data Feed (CDF) in SQL queries using the table_changes table-valued function.
  • Unblock Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have been used.
  • Improved read and write performance on S3 when writing from a single cluster. Efficient file listing decreases the metadata processing time when calculating a table snapshot. This is most impactful for tables with many commits. Set the Hadoop configuration delta.enableFastS3AListFrom to true to enable it.
  • Record VACUUM operations in the transaction log. With this feature, VACUUM operations and their associated metrics (e.g. numDeletedFiles) will now show up in table history.
  • Support reading Delta tables with deletion vectors.
  • Other notable changes
    • Support schema evolution in MERGE for UPDATE SET <assignments> and INSERT (...) VALUES (...) actions. Previously, schema evolution was only supported for UPDATE SET * and INSERT * actions.
    • Add .show() support for COUNT(*) aggregate pushdown.
    • Enforce idempotent writes for df.saveAsTable for overwrite and append mode.
    • Support Table Features to selectively add individual features when upgrading the table protocol version. This enables users to only add active features and will facilitate connectivity as downstream Delta connectors can selectively implement feature support.
    • Automatically generate partition filters for additional generation expressions.
    • Block protocol downgrades when replacing a Delta table to prevent any incorrect time-travel or CDF queries.
    • Fix replaceWhere with the DataFrame V2 overwrite API to correctly evaluate less than conditions.
    • Fix dynamic partition overwrite for tables with more than one partition data type.
    • Fix schema evolution for INSERT OVERWRITE with complex data types when the source schema is read incompatible.
    • Fix Delta streaming source to correctly detect read-incompatible schema changes during backfill when there is exactly one schema change in the versions read.
    • Fix a bug in VACUUM where sometimes the default retention period was used to remove files instead of the retention period specified in the table properties.
    • Include the table name in the DataFrame returned by the deltaTable.details() Python/Scala/Java API.
    • Improve the log message for VACUUM table_name DRY RUN.

How use the preview release

For this preview we have published the artifacts to a staging repository. Here’s how you can use them:

  • spark-submit: Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1066/ to the command line arguments. For example:
    • spark-submit --packages io.delta:delta-core_2.12:2.3.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1066/ examples/examples.py
  • Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 2.3.0rc1 by just providing the --packages io.delta:delta-core_2.12:2.3.0rc1 argument.
  • Maven project:
<repositories>
  <repository>
    <id>staging-repo</id>
    <url> https://oss.sonatype.org/content/repositories/iodelta-1066/</url>
  </repository>
</repositories>
<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.12</artifactId>
  <version>2.3.0rc1</version>
</dependency>
  • SBT project:
libraryDependencies += "io.delta" %% "delta-core" % "2.3.0rc1"
resolvers += "Delta" at  https://oss.sonatype.org/content/repositories/iodelta-1066/
  • Delta-spark:
pip install -i https://test.pypi.org/simple/ delta-spark==2.3.0rc1

Credits

Allison Portis, Andreas Chatzistergiou, Andrew Li, Bo Zhang, Brayan Jules, Burak Yavuz, Christos Stavrakakis, Daniel Tenedorio, Dhruv Shah, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gengliang Wang, Haejoon Lee, Hussein Nagree, Jackie Zhang, Jiaheng Tang, Jintian Liang, Johan Lasperas, Jungtaek Lim, Kam Cheung Ting, Koki Otsuka, Lars Kroll, Lin Ma, Lukas Rupprecht, Ming DAI, Mitchell Riley, Ole Sasse, Paddy Xu, Prakhar Jain, Pranav, Rahul Shivu Mahadev, Rajesh Parangi, Ryan Johnson, Scott Sandre, Serge Rielau, Shixiong Zhu, Slim Ouertani, Tobias Fabritz, Tom van Bussel, Tushar Machavolu, Tyson Condie, Venki Korukanti, Vitalii Li, Wenchen Fan, Xinyi Yu, Yaohua Zhao, Yingyi Bu

v2.0.2

1 year ago

We are excited to announce the release of Delta Lake 2.0.2 on Apache Spark 3.2. This release contains important bug fixes and a few high-demand usability improvements over 2.0.1 and it is recommended that users update to 2.0.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

This release includes the following bug fixes and improvements:

  • Record VACUUM operation in the transaction log. With this feature, VACUUM operations and their associated metrics (e.g. numDeletedFiles) will now show up in table history.
  • Support idempotent writes for DML operations. This feature adds idempotency to INSERTS/DELETE/UPDATE/MERGE etc. operations using SQL configurations spark.databricks.delta.write.txnAppId and spark.databricks.delta.write.txnVersion. Support passing Hadoop configurations via DeltaTable API
    from delta.tables import DeltaTable
    hadoop_config = {
      "fs.azure.account.auth.type": "OAuth",
      "fs.azure.account.oauth.provider.type": "...",
      "fs.azure.account.oauth2.client.id": "...",
      "fs.azure.account.oauth2.client.secret": "...",
      "fs.azure.account.oauth2.client.endpoint": "..."
    }
    delta_table = DeltaTable.forPath(spark, <table-path>, hadoop_config)
    
  • Minor convenience improvement to the DeltaTableBuilder:executeZOrderBy Java API which allows users to pass in varargs instead of a List.
  • Fail fast on malformed delta log JSON entries. Previously, Delta queries could return inaccurate results whenever JSON commits in the _delta_log were malformed. For example, an add action with a missing } would be skipped. Now, queries will fail fast, preventing inaccurate results.
  • Fix “Could not find active SparkSession” bug by passing in the SparkSession when resolving tables in the DeltaTableBuilder.

Credits: Helge Brügner, Jiaheng Tang, Mitchell Riley, Ryan Johnson, Scott Sandre, Venki Korukanti, Jintao Shen, Yann Byron