An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
We are excited to announce the release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
The key features in this release are as follows:
LIMIT
pushdown into Delta scan. Improve the performance of queries containing LIMIT
clauses by pushing down the LIMIT
into Delta scan during query planning. Delta scan uses the LIMIT
and the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could make LIMIT
queries faster by 10-100x depending upon the table size.
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as SELECT COUNT(*)
on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x.
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify NO STATISTICS
clause in the command. Example: CONVERT TO DELTA table_name NO STATISTICS
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from commitTime
to expireTime
. If you already have TTL enabled, please follow the migration steps here.
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
Other notable changes
where()
calls in Optimize scala/python API.
or _
in CONVERT TO DELTA command.MERGE INTO
when there are multiple UPDATE
clauses and one of the UPDATEs is with a schema evolution.SparkSession
object is not found when using Delta APIslast_checkpoint
file fails.AvailableNow
trigger on a Delta table.Credits Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)
We are excited to announce the preview release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
The key features in this release are as follows:
LIMIT
pushdown into Delta scan. Improve the performance of queries containing LIMIT
clauses by pushing down the LIMIT
into Delta scan during query planning. Delta scan uses the LIMIT
and the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could make LIMIT
queries faster by 10-100x depending upon the table size.
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as SELECT COUNT(*)
on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x.
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify NO STATISTICS
clause in the command. Example: CONVERT TO DELTA table_name NO STATISTICS
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from commitTime
to expireTime
. If you already have TTL enabled, please follow the migration steps here.
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
Other notable changes
where()
calls in Optimize scala/python API.
or _
in CONVERT TO DELTA command.MERGE INTO
when there are multiple UPDATE
clauses and one of the UPDATEs is with a schema evolution.SparkSession
object is not found when using Delta APIslast_checkpoint
file fails.AvailableNow
trigger on a Delta table.How to use the preview release For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
spark-submit --packages io.delta:delta-core_2.12:2.2.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ examples/examples.py
2.2.0rc1
by just providing the --packages io.delta:delta-core_2.12:2.2.0rc1
argument.<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1102/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0rc1</version>
</dependency>
libraryDependencies += "io.delta" %% "delta-core" % "2.2.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1102/
pip install -i https://test.pypi.org/simple/ delta-spark==2.2.0rc1
Credits Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)
We are excited to announce the release of Delta Lake 2.0.1 on Apache Spark 3.2. This release contains important bug fixes to 2.0.0 and it is recommended that users update to 2.0.1. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
This release includes the following bug fixes and improvements:
commitTime
to expireTime
. If you already have TTL enabled, please follow the migration steps here.Credits Adam Binford, Allison Portis, Chen Shuai, Lars Kroll, Scott Sandre, Shixiong Zhu, Venki Korukanti
We are excited to announce the release of Delta Lake 2.1.1 on Apache Spark 3.3. This release contains important bug fixes to 2.1.0 and it is recommended that users update to 2.1.1. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
This release includes the following bug fixes and improvements:
commitTime
to expireTime
. If you already have TTL enabled, please follow the migration steps here.Credits Adam Binford, Allison Portis, Chen Shuai, Felipe Pessoto, Lars Kroll, Scott Sandre, Shixiong Zhu, Venki Korukanti
We are excited to announce the preview release of Delta Lake 2.1.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
repartition(1)
instead of coalesce(1)
in Optimize for better performance when merging many small files.DeltaTableBuilder
to preserve table property case of non-delta properties when setting properties.replaceWhere
option.Improvements to the benchmark framework (initial version added in version 1.2.0) including support for benchmarking arbitrary functions and not just SQL queries. We’ve also added Terraform scripts to automatically generate the infrastructure to run benchmarks on AWS and GCP.
For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
–-repositories https://oss.sonatype.org/content/repositories/iodelta-1087/
to the command line arguments. For example:spark-submit --packages io.delta:delta-core_2.12:2.1.0rc1 –-repositories https://oss.sonatype.org/content/repositories/iodelta-1087/ examples/examples.py
<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1087/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.1.0rc1</version>
</dependency>
libraryDependencies += "io.delta" %% "delta-core" % "2.1.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1087/
pip install -i https://test.pypi.org/simple/ delta-spark==2.1.0rc1
Adam Binford, Allison Portis, Andreas Chatzistergiou, Andrew Vine, Andy Lam, Chang Yong Lik, Christos Stavrakakis, David Lewis, Denis Krivenko, Denny Lee, EJ Song, Edmondo Porcu, Felipe Pessoto, Fred Liu, Fu Chen, Grzegorz Kołakowski, Hedi Bejaoui, Hussein Nagree, Ionut Boicu, Ivan Sadikov, Jackie Zhang, Jiawei Bao, Jintao Shen, Jintian Liang, Jonas Irgens Kylling, Juliusz Sompolski, Junlin Zeng, KaiFei Yi, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Lin Zhou, Lukas Rupprecht, Max Gekk, Min Yang, Ming DAI, Nick, Ole Sasse, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Terry Kim, Thomas Newton, Tom van Bussel, Tyson Condie, Venki Korukanti, Vini Jaiswal, Will Jones, Xi Liang, Yijia Cui, Yousry Mohamed, Zach Schuermann, sherlockbeard, yikf
We are excited to announce the release of Delta Lake 2.1.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
repartition(1)
instead of coalesce(1)
in Optimize for better performance when compacting many small files.DeltaTableBuilder
to preserve table property case of non-delta properties when setting properties.replaceWhere
option.Improvements to the benchmark framework (initial version added in version 1.2.0) including support for benchmarking arbitrary functions and not just SQL queries. We’ve also added Terraform scripts to automatically generate the infrastructure to run benchmarks on AWS and GCP.
Adam Binford, Allison Portis, Andreas Chatzistergiou, Andrew Vine, Andy Lam, Carlos Peña, Chang Yong Lik, Christos Stavrakakis, David Lewis, Denis Krivenko, Denny Lee, EJ Song, Edmondo Porcu, Felipe Pessoto, Fred Liu, Fu Chen, Grzegorz Kołakowski, Hedi Bejaoui, Hussein Nagree, Ionut Boicu, Ivan Sadikov, Jackie Zhang, Jiawei Bao, Jintao Shen, Jintian Liang, Jonas Irgens Kylling, Juliusz Sompolski, Junlin Zeng, KaiFei Yi, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Lin Zhou, Lukas Rupprecht, Max Gekk, Min Yang, Ming DAI, Nick, Ole Sasse, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Terry Kim, Thomas Newton, Tom van Bussel, Tyson Condie, Venki Korukanti, Vini Jaiswal, Will Jones, Xi Liang, Yijia Cui, Yousry Mohamed, Zach Schuermann, sherlockbeard, yikf
We are excited to announce the release of Delta Lake 2.0.0 on Apache Spark 3.2.
Support Change Data Feed on Delta tables. Change Data Feed represents the row level changes between different versions of the table. When enabled, additional information is recorded regarding row level changes for every write operation on the table. See the documentation for more details.
Support Z-Order clustering of data to reduce the amount of data read. Z-Ordering is a technique to colocate related information in the same set of files. This data clustering allows column stats (released in Delta 1.2) to be more effective in skipping data based on filters in a query. See the documentation for more details.
Support for idempotent writes to Delta tables to enable fault-tolerant retry of Delta table writing jobs without writing the data multiple times to the table. See the documentation for more details.
Support for dropping columns in a Delta table as a metadata change operation. This command drops the column from metadata and not the column data in underlying files. See documentation for more details.
Support for dynamic partition overwrite. Overwrite only the partitions with data written into them at runtime. See documentation for details.
Experimental support for multi-part checkpoints to split the Delta Lake checkpoint into multiple parts to speed up writing the checkpoints and reading. See documentation for more details.
Python and Scala API support for OPTIMIZE file compaction and Z-order by.
Other notable changes
SimpleAWSCredentialsProvider
or TemporaryAWSCredentialsProvider
in S3 multi-cluster write supported LogStore
.DataFrame
to be written even if the column was nullable.Independent of this release, we have improved the framework for writing large scala performance benchmarks (initial version added in version 1.2.0), we have added support for running benchmarks on Google Compute Platform using Google Dataproc (in addition to the existing support for EMR on AWS)
Adam Binford, Alkis Evlogimenos, Allison Portis, Ankur Dave, Bingkun Pan, Burak Yilmaz, Chang Yong Lik, Chen Qingzhi, Denny Lee, Eric Chang, Felipe Pessoto, Fred Liu, Fu Chen, Gaurav Rupnar, Grzegorz Kołakowski, Hussein Nagree, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jintao Shen, Jintian Liang, John O'Dwyer, Junyong Lee, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Liwen Sun, Lukas Rupprecht, Max Gekk, Michael Mengarelli, Min Yang, Naga Raju Bhanoori, Nick Grigoriev, Nick Karpov, Ole Sasse, Patrick Grandjean, Peng Zhong, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Ruslan Dautkhanov, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Shoumik Palkar, Tathagata Das, Terry Kim, Tyson Condie, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Xinyi, Yijia Cui, Yousry Mohamed
We are excited to announce the preview release of Delta Lake 2.0.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
Support Change Data Feed on Delta tables. Change Data Feed represents the row level changes between different versions of the table. When enabled, additional information is recorded regarding row level changes for every write operation on the table. See the documentation for more details.
Support Z-Order clustering of data to reduce the amount of data read. Z-Ordering is a technique to colocate related information in the same set of files. This data clustering allows column stats (released in Delta 1.2) to be more effective in skipping data based on filters in a query. See the documentation for more details.
Support for idempotent writes to Delta tables to enable fault-tolerant retry of Delta table writing jobs without writing the data multiple times to the table. See the documentation for more details.
Support for dropping columns in a Delta table as a metadata change operation. This command drops the column from metadata and not the column data in underlying files. See documentation for more details.
Support for dynamic partition overwrite. Overwrite only the partitions with data written into them at runtime. See documentation for details.
Experimental support for multi-part checkpoints to split the Delta Lake checkpoint into multiple parts to speed up writing the checkpoints and reading. See documentation for more details.
Python and Scala API support for OPTIMIZE file compaction and Z-order by.
Other notable changes
Independent of this release, we have improved the framework for writing large scala performance benchmarks (initial version added in version 1.2.0), we have added support for running benchmarks on Google Compute Platform using Google Dataproc (in addition to the existing support for EMR on AWS)
Adam Binford, Alkis Evlogimenos, Allison Portis, Ankur Dave, Bingkun Pan, Burak Yilmaz, Chang Yong Lik, Chen Qingzhi, Denny Lee, Eric Chang, Fred Liu, Fu Chen, Gaurav Rupnar, Grzegorz Kołakowski, Hussein Nagree, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jintao Shen, Jintian Liang, John O'Dwyer, Junyong Lee, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Liwen Sun, Lukas Rupprecht, Max Gekk, Michael Mengarelli, Min Yang, Naga Raju Bhanoori, Nick Karpov, Ole Sasse, Patrick Grandjean, Peng Zhong, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Ruslan Dautkhanov, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Shoumik Palkar, Tathagata Das, Terry Kim, Tyson Condie, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Xinyi, Yijia Cui
We are excited to announce the release of Delta Lake 1.2.1 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
--packages
mode. Previous release had a bug that resulted in user getting NullPointerException
instead of proper error message when using Delta Lake with --packages
mode either in pyspark
or spark-shell
(Fix, Test)pyspark
to throw incorrect type of exceptions instead of expected AnalysisException
. This issue is fixed. See issue #1086 for more details.--conf
to not work for certain configuration parameters. This issue is fixed by having these configuration parameters begin with spark
. See the updated documentation.LogStore
implementation class config spark.delta.logStore.gs.impl
from the scheme in the table path. See the updated documentation.Allison Portis, Chang Yong Lik, Kam Cheung Ting, Rahul Mahadev, Scott Sandre, Venki Korukanti
We are excited to announce the release of Delta Lake 1.2.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
Support multi-cluster write in Delta Lake tables stored in S3. Users now have the option of specifying a new and experimental LogStore
implementation that supports concurrent reads and writes to a single Delta Lake table in S3 from multiple Spark drivers. See the documentation for more details.
Support for compacting small files (optimize) into larger files in a Delta Lake table. Reduced number of data files improves read latency due to reduced metadata size and per-file overheads such as file-open overhead and file-close overhead. See the documentation for more details.
Support for data skipping using column statistics. Column statistics are collected for each file as part of the Delta Lake table writes. These statistics can be used during the reading of a Delta Lake table to skip reading files not matching the filters in the query. See the documentation for more details.
Support for restoring a Delta table to an earlier version. Restoring to an earlier version number or a version of a specific timestamp is supported using the SQL command, Scala APIs or Python APIs. See the documentation for more details.
Support for column renaming in a Delta Lake table without the need to rewrite the underlying Parquet data files. See the documentation for more details.
Support for arbitrary characters in column names in Delta tables. Before, the supported list of characters was limited by the support of the same in Parquet data format. Column names containing special characters such space, tab, ,
, {
, (
etc. are supported now. See the documentation for more details.
Support for automatic data skipping using generated columns. For any partition column that is a generated column, partition filters will be automatically generated from any data filters on its generating column(s), when possible.
Support for Google Cloud Storage is now generally available. See the documentation on how to read and write Delta Lake tables in Google Cloud Storage.
Other notable changes
delta-storage
. This extracts out the LogStore
interface and implementations in a separate module which is published as its own jar. This enables new implementations of LogStore
without depending upon the complete Delta jars. See the migration guide here for more details.gettimestamp
expression in generated columns.list
calls to storageNullPointerException
when trying to reference a DeltaLog
created with a SparkContext
that has stopped.Array
.FileNotFoundException
when reading Delta log files to distinguish between the corrupt log files and no files found.Independent of this release, we have also built a framework for writing large scale performance benchmarks on Delta tables using a real cluster. Currently, the framework provides a TPC-DS inspired benchmark to measure the ingestion time (e.g. time taken to create TPC-DS tables) and query times. But we encourage the community to contribute more benchmarks to measure performance of different real-world workloads on Delta tables.
Adam Binford, Alex Liu, Allison Portis, Anton Okolnychyi, Bart Samwel, Carmen Kwan, Chang Yong Lik, Christian Williams, Christos Stavrakakis, David Lewis, Denny Lee, Fabio Badalì, Fred Liu, Gengliang Wang, Hoang Pham, Hussein Nagree, Hyukjin Kwon, Jackie Zhang, Jan Paw, John ODwyer, Junlin Zeng, Jackie Zhang, Junyong Lee, Kam Cheung Ting, Kapil Sreedharan, Lars Kroll, Liwen Sun, Maksym Dovhal, Mariusz Krynski, Meng Tong, Peng Zhong, Prakhar Jain, Pranav, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Sri Tikkireddy, Tathagata Das, Tyson Condie, Vegard Stikbakke, Venkata Sai Akhil Gudesa, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Will Jones, Xinyi Yu, Yann Byron, Yaohua Zhao, Yijia Cui