Blazingsql Versions Save

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

v21.08.00

2 years ago

Improvements

Update ucx-py versions to 0.21
return ok for filesystems
Setting up default value for max_bytes_chunk_read to 256 MB

Bug Fixes

Fix build due to changes in rmm device buffer
Fix reading decimal columns from orc file
Fix CC/CXX variables in CI
Fix latest cudf dependencies
Fix concat suite E2E test for nested calls
Fix for GCS credentials from filepath
Fix decimal support using float64
Fix build issue with thrust package

v21.06.00

2 years ago

Note new versioning system from Major.Minor to Year.Month. Previous version was 0.19.

New Features

Limited support of unbounded partitioned windows
Support for CURRENT_DATE, CURRENT_TIME and CURRENT_TIMESTAMP
Support for right outer join
Support for DURATION type
Support for IS NOT FALSE condition
Support ORDERing by null values
Support for multiple columns inside COUNT() statement

Improvements

Support for concurrency in E2E tests
Better Support for unsigned types in C++ side
Folder refactoring related to caches, kernels, execution_graph, BlazingTable
Improve data loading when the algebra contains only BindableScan/Scan and Limit
Enable support for spdlog 1.8.5
Update RAPIDS version references

Bug Fixes

Fix IS NOT DISTINCT FROM with joins
Fix wrong results from timestampdiff/add
Fixed build issues due to cudf aggregation API change
Comparing param set to true for e2e
Fixed provider unit_tests
Fix orc statistic building
Fix Decimal/Fixed Point issue
Fix for max_bytes_chunk_read param to csv files
Fix ucx-py versioning specs
Reading chunks of max bytes for csv files

v0.19.0

3 years ago

New Features

New API that supports concurrent queries, by starting a query and obtaining a token, and then retrieving the result with that token.
Support for string CONCAT using the CONCAT keyword, instead of '||'.
New API to get the physical execution plan: bc.explain(query, detail = True)
Support for querying PostgreSQL tables
New documentation page

Improvements

Improvements and expansion to the end-to-end testing framework, including adding testing for data with nulls
Improved performance of joins by adding a timeout to the concatenating CacheMachine
Improved kernel row output estimation

Bug Fixes

Fixed bugs in uninitialized variables in orc metadata and improvements to handling the parseMetadata exceptions
Fixed bugs in handling nulls in case conditions with strings
Fixed issue with deleting allocated host memory
Fixed issues in capturing error messages from exceptions
Fixed bug when there are no projects in a BindableTableScan
Fixed issues from cuda when freeing pinned memory
Fixed bug in DistributeAggregationKernel where the wrong columns were being hashed
FIxed bug with empty row group ids for parquet
Fixed issues with int64 literal values
Fixed issue when CAST was applied to a literal
Fixed bug when getting ORC metadata for decimal type
Fixed bug with substrings with nulls
Fixed support for minus unary operator
Fixed bug with calculating number of batches in BindableTableScan
Fixed bug with full outer join when both tables contained nulls
Fixed bug with COUNT DISTINCT
Fixed issue with columns aliases when there was a Join operation
Fixed issue with python side exceptions
Fixed various issues due to changes in cudf or other dependencies

Window Functions (Experimental)

This release now provides limited Window Functions support. Window Functions that have the partition by clause support the following aggregations:

MIN
MAX
COUNT
SUM
AVG
ROW_NUMBER
LEAD
LAG Window Functions that have the do not have a partition by clause and have a bounded window frame using the ROWS BETWEEN (the window frame does not use the keyword UNBOUNDED) support the following aggregations:
MIN
MAX
COUNT
SUM
AVG At this moment, window frames using the keywords UNBOUNDED and CURRENT ROW don't fully work.

Deprecated Features

Disabled support for outer joins with inequalities

v0.18.0

3 years ago

New SQL Functions

The following SQL commands are now supported:

REGEXP_REPLACE
INITCAP

New Features

New centralized task executor for all query execution
New pinned memory buffer pool for improved performance in communication
New host memory buffer pool for improved performance in caching data to system memory
Support for UCX communications which enables usage of high performance communication hardware such as using InfiniBand
Creating table from ORC files now collects metadata from ORC files and can perform predicate pushdown on metadata
Progress bar when executing queries
Added ability to try to retry tasks when getting out of memory errors
Added ability to get maximum gpu memory used

Improvements

Improved support for concurrent queries
Improvements to query execution logs
Added/improved communication logs
Added ability to disable logs
Improved storage plugin output messages
Improved support for creating tables from JSON files

Bug Fixes

Fixed distribution so that its evenly distributes data loading based off of rowgroups
Fixed cython exception handling
Support FileSystems (GS, S3) when extension of the files are not provided
Fixed issue when creating tables from a local dir relative path
Misc bug fixes

Codebase improvements

Code base clean up, improved code organization and refactoring
No longer depending on gtest for runtime
Reduced number of compilation warnings

v0.17.0

3 years ago

New SQL Functions

The following SQL commands are now supported:

TO_DATE / TO_TIMESTAMP
DAYOFWEEK
TRIM / LTRIM / RTRIM
LEFT / RIGHT
UPPER / LOWER
REPLACE
REVERSE

New Features

New communications architecture with support for both TCP and UCX (UCX support is in beta)
Allow to create tables from compressed text delimited files
Allow to create tables off of Hive partitioned folder structure, where BlazingSQL will infer columns and types.
Added powerPC building script and instructions
Added local logging directory option to BlazingContext to help resolve logging file permission issues
Added option to read csv files in chunks
Logs are now configurable to have max size and be rotated

Improvements

Added Apache Calcite rule for window functions. (Window functions not supported yet)
Add validation for the kwargs when BlazingContext.create_table API is called
Added validation for s3 buckets
Added scheduler file support for e2e testing framework
Improved how sampling is done for ORDER BY
Several changes to keep up with cuDF API changes
Remove temp files when an error occurs
Added new end-to-end tests
Added new unit tests
Improved contribution documentation
Code refactoring and removing dead or duplicate code

Improvements in error logging

Improvement to error messaging when validating any GCP bucket
Added error logging in DataSourceSequence
Showing an appropriate error to indicate that we don't support opening directories with wildcards
Showing an appropriate error for invalid or unsupported expressions on the logical plan

Changes or improvements in technology stack or CI

Added output compile json option for cppcheck
Bump junit from 4.12 to 4.13.1 in /algebra
Improved gpuCI scripts
Removed need to specify cuda version via a label for conda packages
Fixed cmake version to be 3.18.4
Fix SSL errors for conda

Bug Fixes

Fixed issue when loading parquet files with local_files=True
Fixed logging directory setup
Fixed issues with config_options
Fixed issue in float columns when parsing parquet metadata
Fixed bug in MergeAggregations when single node has multiple batches
Fix graph thread pool hang when exception is thrown
Fix ignore headers when multiple CSV files was provided
Fix column_names (table) always as list of string
Fixed literal type inference for integers

Deprecated features

Deprecated bc.partition

v0.16.0

3 years ago

Improvements

Activate End-to-end test result validation for GPU_CI.
Add capacity to set the transport memory
Update conda recipe, remove cxx11 abi from cmake
Just one initialize() function at beginning and add logs related to allocation stuff
Make possible to read the system environment variables to setup config_option for BlazingContext
Update TPCH queries for end to end tests: converting implicit joins into explicit joins
Removing cudf source code dependency as some cudf utilities headers were exposed
Can now set manually BLAZING_CACHE_DIRECTORY

Bug Fixes

Fixed issue due to cudf orc api change
Fixed issue parsing fixed width string literals
Fixed issue with hive string columns
Fixed issue due to an rmm include
Fixed build issues with latest rmm 0.16 and columnBasisTest due to deprecated drop_column() function
Fix metadata mistmatch due to parsedMetadata, caused by parquet files that had only nulls in certain columns for only some files
Removed workaround for parquet read schema
Fixed issue caused by creating tables with multiple csv files and having BSQL infer the datatypes and having a dtypes mismatch
Avoid read _metadata files
Fixed issues with parsers, in particular ORC parser was misbehaving
Fixed issue with logging directories in distributed environments
Pinned google cloud version to 1.16
Partial revert of some changes on parquet rowgroups flow with local_files=True
Fixed issue when loading paths with wildcards
Fixed issue with concat_all in concatenating cache
Fix arrow and spdlog compilation issues
Fixed intra-query memory leak in joins
Fixed crash when loading an empty folder
Fixed parseSchemaPython can throw exceptions

v0.15.0

3 years ago

New Features:

Added a memory monitor for better memory management for out of core processing
Added list_tables() and describe_table() functions
Added support for constant expressions evaluation by Calcite
Added support for cross join
Added rand() and support for running unary operations on literals
Added get_free_memory() function

Improvements

Performance improvements:

Implemented Unordered pull from cache to help performance
Concatenating cache improvement and replacing PartwiseJoin::load_set with a concatenating cache
Adding max kernel num threads pool
Added new separate thresh for concat cache

Stability improvements:

Added checks for concatenation to prevent String overflow
Added nogil statements for pure C functions in Cython
Round robing dask workers on single gpu queries
Reraising query errors in context.py
Implemented using threadpool for outgoing messages

Documentation improvements:

Added exhale to generate doxygen for sphinx docs
Added Sphinx based code architecture documentation
Added doxygen comments to CacheMachine.h
Added more documentation about memory management
Updated readme
Added doxygen comments to some kernels and the batch processing

Building improvements:

Updated Calcite to the most recent version 1.23
Added check for CUDF_HOME to allow build to use an existing prebuilt cudf source tree
Python/Cython check code style
Make AWS and GCS optional

Logging improvements:

Logging level (flush_on) can be configurable
Set log_level when using LOGGING_LEVEL param

Testing improvements:

Added unit tests on Calcite to check how logical plans are affected when rulesets are updated
Updated set of TPCH queries on the E2E tests
Added initial set of unit tests for WaitingQueue and nullptr checks around spdlog calls
Add unit test for Project kernel

Other improvements:

Removed a lot of dead code from the codebase
Replace random_generator with cudf::sample
Adding extern C for include files
Use default client and network interface from Dask. BlazingSQL should now be able to infer the network interface.
Updated the GPUManager functions
Handle exceptions from pool_threads

Bug Fixes

Various fixing of issues due to updates to cudf
Fixed issue with Hive partitions when doing SELECT *
Normalize columns before distribution in JoinPartitionKernel
Fixed issue with hive partitions base folder
Fix interops operators output types
Fix when the algebra plan was provided using one-line as logical plan
Fix issue related to Hive metadata
Remove temp files from data cached to disk
Fix when checking only Limit and Scan Kernels
Loading one file at a time (LimitKernel and ScanKernel)
Fixed small issue with hive types conversion
Fix for literal cast
Fixed issue with start and length of substring being different types
Fixed issue on logical plans when there is an EXISTS clause
Fixed issue with casting string to string
Fixed issue with getting table scan info
Fixed row_groups issue in ParquetParser.cpp
Fixed issue with some constant expressions not evaluated by calcite
Fixed issue with log directory creation in a distributed environment
Fixed issue where we were including testing hpp in our code
Fixed optimization regression on the select count(*) case
Fixed issue caused by using new arrow_io_source
Fixed e2e string comparison
Fixed random segfault issue in parser
Fixed issue with column names on sample function
Introduced config param for max orderby samples and fixed issue with oversampling in ORDER BY

v0.14.0

3 years ago

New Features:

New execution architecture, supporting executing queries on data that does not fit in the GPU. The new architecture features the following:
- The execution model is an acyclic graph of execution nodes with a cache in between execution nodes.
- Each execution node operates independently on batches of data, allowing it to process steps in parallel as much as possible instead of sequentially.
- Each cache between every execution step can hold the data in GPU, in system memory or on disk.
- Has support for multi-partition dask.cudf.DataFrame result set outputs.
Added ability to set configuration options
Added support for using NULL as a literal value
Implemented CHAR_LENGTH function
Added ability to specify region for S3 buckets
Added type normalization for UNION ALL
Added support for MinIO Storage

Improvements:

Improved support for CAST function to include TINYINT and SMALLINT
Handle behavior when the optimized plan contains a LogicalValues
Improvements to exception handling
Support modern compilers (>= g++-7.x)
Improved logging now uses spdlog
Adding event logging
BlazingSQL engine no longer needs to concatenate dask.cudf.DataFrame partitions prior to running a query on a dask.cudf.DataFrame table
Improved expression parser, including support for expression trees of unlimited size.
Optimized data loading for queries of the type: SELECT * FROM table LIMIT N
Added built in end to end testing framework
Added logging to condition variables that are waiting too long

Bug Fixes:

Fixed bug in size estimation for tables before joins
Fixed issue with excessive thread creation in communication
Fixed bug in expression parsing for joins
Fixed bug caused by sharing data loaders when a query has one table more than once
FIxed Hive file format inference

v0.13.0

4 years ago

New Features:

Support for AVG in distributed mode
Added ability to use existing memory allocator
Implemented unify_partitions function for preparing dask_cudf DataFrames prior to creating BlazingSQL tables
Implemented ROUND function
Implemented support for CASE with strings

Improvements:

Local files can be referenced with relative file paths when creating tables.
Automatic casting for joins on similar data types (i.e. joining an int32 with an int64 will cast the int32 to an int64)
Updated AWS SDK version
More changes to related to changes migration of libcudf to libcudf++
Added docstrings to main python APIs

Bug Fixes:

Fixed bug when for joining against empty DataFrame
Fixed bug with GROUP BY ignoring nulls
Fixed various issues related to creating tables from dask_cudf DataFrames
Fixed various bugs with creating tables from Hive Cursor
Fixed bugs related to new libcudf++ functionality
Fixed bug in LIMIT statement
Fixed bug in timestamp processing
Fixed bug in SUM0 aggregation (which enables COUNT DISTINCT)
Fixed bug when querying single file with multiple workers
Fixed bug with distributed COUNT aggregation without GROUP BY
Fixed bug when creating and querying a table with several Apache Parquet files and one is empty
Fixed bug with joins with nulls in the join key columns

Other:

Temporarily deprecated JSON reader. In the meantime we recommend using: cudf.read_json

v0.12.0

4 years ago

New Features:

Ability to skip reading and processing row groups when querying Apache Parquet files by applying predicates on metadata
Ability to do SELECT COUNT (DISTINCT column)
Ability to use and set Pool memory allocator for increased performance and/or managed (UVM) allocator which provides robustness against running out of GPU memory

Improvements:

New building scripts thanks to @dillon-cullinan

Bug Fixes:

Fixed various bugs in the Apache Arrow provider
Fixed bug with incorrect data type in CASE statements
Fixed bug and memory leak in distributed joins
Fixed bug in usage of Google Cloud Storage plugin