Delta Io Delta Versions Save

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

v0.5.0

4 years ago

We are excited to announce the release of Delta Lake 0.5.0, which introduces Presto/Athena support and improved concurrency. The key features in this release are:

  • Support for other processing engines using manifest files (#76) - You can now query Delta tables from Presto and Amazon Athena using manifest files, which you can generate using Scala, Java, Python and SQL APIs. See the documentation for details. 

  • Improved concurrency for all Delta Lake operations (#9, #72, #228) - You can now run more Delta Lake operations concurrently. Delta Lake’s optimistic concurrency control has been improved by making conflict detection more fine-grained. This makes it easier to run complex workflows on Delta tables. For example:

    • Running deletes (e.g. for GDPR compliance) concurrently on older partitions while newer partitions are being appended.
    • Running updates and merges concurrently on disjoint sets of partitions.
    • Running file compactions concurrently with appends (see below).

    See the documentation on concurrency control for more details.

  • Improved support for file compaction (#146) - You can now compact files by rewriting them with the DataFrameWriter option dataChange set to false. This option allows a compaction operation to run concurrently with other batch and streaming operations. See this example in the documentation for details.

  • Improved performance for insert-only merge (#246) - Delta Lake now provides more optimized performance for merge operations that have only insert clauses and no update clauses. Furthermore, Delta Lake ensures that writes from such insert-only merges only append new data to the table. Hence, you can now use Structured Streaming and insert-only merges to do continuous deduplication of data (e.g. logs). See this example in the documentation for details.

  • SQL Support for Convert-to-Delta (#175) - You can now use SQL to convert a Parquet table to Delta (Scala, Java, and Python were already supported in 0.4.0). See the documentation for details.

  • Experimental support for Snowflake and Redshift Spectrum - You can now query Delta tables from Snowflake and Redshift Spectrum. This support is considered experimental in this release. See the documentation for details.

Credits

Andreas Neumann, Andrew Fogarty, Burak Yavuz, Denny Lee, Fabio B. Silva, JassAbidi, Matthew Powers, Mukul Murthy, Nicolas Paris, Pranav Anand, Rahul Mahadev, Reynold Xin, Shixiong Zhu, Tathagata Das, Tomas Bartalos, Xiao Li

Thank you for your contributions.

v0.4.0

4 years ago

We are excited to announce the release of Delta Lake 0.4.0 which introduces Python APIs for manipulating and managing data in Delta tables. The key features in this release are:

Python APIs for DML and utility operations (#89) - You can now use Python APIs to update/delete/merge data in Delta Lake tables and to run utility operations (i.e., vacuum, history) on them. These are great for building complex workloads in Python, e.g., Slowly Changing Dimension (SCD) operations, merging change data for replication, and upserts from streaming queries. See the documentation for more details.

Convert-to-Delta (#78) - You can now convert a Parquet table in place to a Delta Lake table without rewriting any of the data. This is great for converting very large Parquet tables which would be costly to rewrite as a Delta table. Furthermore, this process is reversible - you can convert a Parquet table to Delta Lake table, operate on it (e.g., delete or merge), and easily convert it back to a Parquet table. See the documentation for more details.

SQL for utility operations - You can now use SQL to run utility operations vacuum and history. See the documentation for more details on how to configure Spark to execute these Delta-specific SQL commands.

To try out Delta Lake 0.4.0, please follow the Delta Lake Quickstart.

v0.3.0

4 years ago

We are excited to announce the availability of Delta Lake 0.3.0 which introduces new programmatic APIs for manipulating and managing data in Delta Lake tables. Here are the main features:

  • Scala/Java APIs for DML commands - You can now modify data in Delta Lake tables using programmatic APIs for Delete (#44), Update (#43) and Merge (#42). These APIs mirror the syntax and semantics of their corresponding SQL commands and are great for many workloads, e.g., Slowly Changing Dimension (SCD) operations, merging change data for replication, and upserts from streaming queries. See the documentation for more details.

  • Scala/Java APIs for query commit history (#54) - You can now query a table’s commit history to see what operations modified the table. This enables you to audit data changes, time travel queries on specific versions, debug and recover data from accidental deletions, etc. See the documentation for more details.

  • Scala/Java APIs for vacuuming old files (#48) - Delta Lake uses MVCC to enable snapshot isolation and time travel. However, keeping all versions of a table forever can be prohibitively expensive. Stale snapshots (as well as other uncommitted files from aborted transactions) can be garbage collected by vacuuming the table. See the documentation for more details.

To try out Delta Lake 0.3.0, please follow the Delta Lake Quickstart.

v0.2.0

4 years ago

We are delighted to announce the availability of Delta Lake 0.2.0!

To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart.

This release introduces two main features:

  • Cloud storage support - In addition to HDFS, you can now configure Delta Lake to read and write data on cloud storage services such as Amazon S3 (issue #39) and Azure Blob Storage (issue #40). See here for configuration instructions.

  • Improved concurrency (issue #69) - Delta Lake now allows concurrent append-only writes while still ensuring serializability. To be considered as append-only, a writer must be only adding new data without reading or modifying existing data in any way. See here for more details.

We have also greatly expanded the test coverage as part of this release.