Blazingsql Versions Save

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

v21.08.00

2 years ago

Improvements

  • Update ucx-py versions to 0.21
  • return ok for filesystems
  • Setting up default value for max_bytes_chunk_read to 256 MB

Bug Fixes

  • Fix build due to changes in rmm device buffer
  • Fix reading decimal columns from orc file
  • Fix CC/CXX variables in CI
  • Fix latest cudf dependencies
  • Fix concat suite E2E test for nested calls
  • Fix for GCS credentials from filepath
  • Fix decimal support using float64
  • Fix build issue with thrust package

v21.06.00

2 years ago

Note new versioning system from Major.Minor to Year.Month. Previous version was 0.19.

New Features

  • Limited support of unbounded partitioned windows
  • Support for CURRENT_DATE, CURRENT_TIME and CURRENT_TIMESTAMP
  • Support for right outer join
  • Support for DURATION type
  • Support for IS NOT FALSE condition
  • Support ORDERing by null values
  • Support for multiple columns inside COUNT() statement

Improvements

  • Support for concurrency in E2E tests
  • Better Support for unsigned types in C++ side
  • Folder refactoring related to caches, kernels, execution_graph, BlazingTable
  • Improve data loading when the algebra contains only BindableScan/Scan and Limit
  • Enable support for spdlog 1.8.5
  • Update RAPIDS version references

Bug Fixes

  • Fix IS NOT DISTINCT FROM with joins
  • Fix wrong results from timestampdiff/add
  • Fixed build issues due to cudf aggregation API change
  • Comparing param set to true for e2e
  • Fixed provider unit_tests
  • Fix orc statistic building
  • Fix Decimal/Fixed Point issue
  • Fix for max_bytes_chunk_read param to csv files
  • Fix ucx-py versioning specs
  • Reading chunks of max bytes for csv files

v0.19.0

3 years ago

New Features

  • New API that supports concurrent queries, by starting a query and obtaining a token, and then retrieving the result with that token.
  • Support for string CONCAT using the CONCAT keyword, instead of '||'.
  • New API to get the physical execution plan: bc.explain(query, detail = True)
  • Support for querying PostgreSQL tables
  • New documentation page

Improvements

  • Improvements and expansion to the end-to-end testing framework, including adding testing for data with nulls
  • Improved performance of joins by adding a timeout to the concatenating CacheMachine
  • Improved kernel row output estimation

Bug Fixes

  • Fixed bugs in uninitialized variables in orc metadata and improvements to handling the parseMetadata exceptions
  • Fixed bugs in handling nulls in case conditions with strings
  • Fixed issue with deleting allocated host memory
  • Fixed issues in capturing error messages from exceptions
  • Fixed bug when there are no projects in a BindableTableScan
  • Fixed issues from cuda when freeing pinned memory
  • Fixed bug in DistributeAggregationKernel where the wrong columns were being hashed
  • FIxed bug with empty row group ids for parquet
  • Fixed issues with int64 literal values
  • Fixed issue when CAST was applied to a literal
  • Fixed bug when getting ORC metadata for decimal type
  • Fixed bug with substrings with nulls
  • Fixed support for minus unary operator
  • Fixed bug with calculating number of batches in BindableTableScan
  • Fixed bug with full outer join when both tables contained nulls
  • Fixed bug with COUNT DISTINCT
  • Fixed issue with columns aliases when there was a Join operation
  • Fixed issue with python side exceptions
  • Fixed various issues due to changes in cudf or other dependencies

Window Functions (Experimental)

This release now provides limited Window Functions support. Window Functions that have the partition by clause support the following aggregations:

  • MIN
  • MAX
  • COUNT
  • SUM
  • AVG
  • ROW_NUMBER
  • LEAD
  • LAG Window Functions that have the do not have a partition by clause and have a bounded window frame using the ROWS BETWEEN (the window frame does not use the keyword UNBOUNDED) support the following aggregations:
  • MIN
  • MAX
  • COUNT
  • SUM
  • AVG At this moment, window frames using the keywords UNBOUNDED and CURRENT ROW don't fully work.

Deprecated Features

  • Disabled support for outer joins with inequalities

v0.18.0

3 years ago

New SQL Functions

The following SQL commands are now supported:

  • REGEXP_REPLACE
  • INITCAP

New Features

  • New centralized task executor for all query execution
  • New pinned memory buffer pool for improved performance in communication
  • New host memory buffer pool for improved performance in caching data to system memory
  • Support for UCX communications which enables usage of high performance communication hardware such as using InfiniBand
  • Creating table from ORC files now collects metadata from ORC files and can perform predicate pushdown on metadata
  • Progress bar when executing queries
  • Added ability to try to retry tasks when getting out of memory errors
  • Added ability to get maximum gpu memory used

Improvements

  • Improved support for concurrent queries
  • Improvements to query execution logs
  • Added/improved communication logs
  • Added ability to disable logs
  • Improved storage plugin output messages
  • Improved support for creating tables from JSON files

Bug Fixes

  • Fixed distribution so that its evenly distributes data loading based off of rowgroups
  • Fixed cython exception handling
  • Support FileSystems (GS, S3) when extension of the files are not provided
  • Fixed issue when creating tables from a local dir relative path
  • Misc bug fixes

Codebase improvements

  • Code base clean up, improved code organization and refactoring
  • No longer depending on gtest for runtime
  • Reduced number of compilation warnings

v0.17.0

3 years ago

New SQL Functions

The following SQL commands are now supported:

  • TO_DATE / TO_TIMESTAMP
  • DAYOFWEEK
  • TRIM / LTRIM / RTRIM
  • LEFT / RIGHT
  • UPPER / LOWER
  • REPLACE
  • REVERSE

New Features

  • New communications architecture with support for both TCP and UCX (UCX support is in beta)
  • Allow to create tables from compressed text delimited files
  • Allow to create tables off of Hive partitioned folder structure, where BlazingSQL will infer columns and types.
  • Added powerPC building script and instructions
  • Added local logging directory option to BlazingContext to help resolve logging file permission issues
  • Added option to read csv files in chunks
  • Logs are now configurable to have max size and be rotated

Improvements

  • Added Apache Calcite rule for window functions. (Window functions not supported yet)
  • Add validation for the kwargs when BlazingContext.create_table API is called
  • Added validation for s3 buckets
  • Added scheduler file support for e2e testing framework
  • Improved how sampling is done for ORDER BY
  • Several changes to keep up with cuDF API changes
  • Remove temp files when an error occurs
  • Added new end-to-end tests
  • Added new unit tests
  • Improved contribution documentation
  • Code refactoring and removing dead or duplicate code

Improvements in error logging

  • Improvement to error messaging when validating any GCP bucket
  • Added error logging in DataSourceSequence
  • Showing an appropriate error to indicate that we don't support opening directories with wildcards
  • Showing an appropriate error for invalid or unsupported expressions on the logical plan

Changes or improvements in technology stack or CI

  • Added output compile json option for cppcheck
  • Bump junit from 4.12 to 4.13.1 in /algebra
  • Improved gpuCI scripts
  • Removed need to specify cuda version via a label for conda packages
  • Fixed cmake version to be 3.18.4
  • Fix SSL errors for conda

Bug Fixes

  • Fixed issue when loading parquet files with local_files=True
  • Fixed logging directory setup
  • Fixed issues with config_options
  • Fixed issue in float columns when parsing parquet metadata
  • Fixed bug in MergeAggregations when single node has multiple batches
  • Fix graph thread pool hang when exception is thrown
  • Fix ignore headers when multiple CSV files was provided
  • Fix column_names (table) always as list of string
  • Fixed literal type inference for integers

Deprecated features

  • Deprecated bc.partition

v0.16.0

3 years ago

Improvements

  • Activate End-to-end test result validation for GPU_CI.
  • Add capacity to set the transport memory
  • Update conda recipe, remove cxx11 abi from cmake
  • Just one initialize() function at beginning and add logs related to allocation stuff
  • Make possible to read the system environment variables to setup config_option for BlazingContext
  • Update TPCH queries for end to end tests: converting implicit joins into explicit joins
  • Removing cudf source code dependency as some cudf utilities headers were exposed
  • Can now set manually BLAZING_CACHE_DIRECTORY

Bug Fixes

  • Fixed issue due to cudf orc api change
  • Fixed issue parsing fixed width string literals
  • Fixed issue with hive string columns
  • Fixed issue due to an rmm include
  • Fixed build issues with latest rmm 0.16 and columnBasisTest due to deprecated drop_column() function
  • Fix metadata mistmatch due to parsedMetadata, caused by parquet files that had only nulls in certain columns for only some files
  • Removed workaround for parquet read schema
  • Fixed issue caused by creating tables with multiple csv files and having BSQL infer the datatypes and having a dtypes mismatch
  • Avoid read _metadata files
  • Fixed issues with parsers, in particular ORC parser was misbehaving
  • Fixed issue with logging directories in distributed environments
  • Pinned google cloud version to 1.16
  • Partial revert of some changes on parquet rowgroups flow with local_files=True
  • Fixed issue when loading paths with wildcards
  • Fixed issue with concat_all in concatenating cache
  • Fix arrow and spdlog compilation issues
  • Fixed intra-query memory leak in joins
  • Fixed crash when loading an empty folder
  • Fixed parseSchemaPython can throw exceptions

v0.15.0

3 years ago

New Features:

  • Added a memory monitor for better memory management for out of core processing
  • Added list_tables() and describe_table() functions
  • Added support for constant expressions evaluation by Calcite
  • Added support for cross join
  • Added rand() and support for running unary operations on literals
  • Added get_free_memory() function

Improvements

Performance improvements:

  • Implemented Unordered pull from cache to help performance
  • Concatenating cache improvement and replacing PartwiseJoin::load_set with a concatenating cache
  • Adding max kernel num threads pool
  • Added new separate thresh for concat cache

Stability improvements:

  • Added checks for concatenation to prevent String overflow
  • Added nogil statements for pure C functions in Cython
  • Round robing dask workers on single gpu queries
  • Reraising query errors in context.py
  • Implemented using threadpool for outgoing messages

Documentation improvements:

  • Added exhale to generate doxygen for sphinx docs
  • Added Sphinx based code architecture documentation
  • Added doxygen comments to CacheMachine.h
  • Added more documentation about memory management
  • Updated readme
  • Added doxygen comments to some kernels and the batch processing

Building improvements:

  • Updated Calcite to the most recent version 1.23
  • Added check for CUDF_HOME to allow build to use an existing prebuilt cudf source tree
  • Python/Cython check code style
  • Make AWS and GCS optional

Logging improvements:

  • Logging level (flush_on) can be configurable
  • Set log_level when using LOGGING_LEVEL param

Testing improvements:

  • Added unit tests on Calcite to check how logical plans are affected when rulesets are updated
  • Updated set of TPCH queries on the E2E tests
  • Added initial set of unit tests for WaitingQueue and nullptr checks around spdlog calls
  • Add unit test for Project kernel

Other improvements:

  • Removed a lot of dead code from the codebase
  • Replace random_generator with cudf::sample
  • Adding extern C for include files
  • Use default client and network interface from Dask. BlazingSQL should now be able to infer the network interface.
  • Updated the GPUManager functions
  • Handle exceptions from pool_threads

Bug Fixes

  • Various fixing of issues due to updates to cudf
  • Fixed issue with Hive partitions when doing SELECT *
  • Normalize columns before distribution in JoinPartitionKernel
  • Fixed issue with hive partitions base folder
  • Fix interops operators output types
  • Fix when the algebra plan was provided using one-line as logical plan
  • Fix issue related to Hive metadata
  • Remove temp files from data cached to disk
  • Fix when checking only Limit and Scan Kernels
  • Loading one file at a time (LimitKernel and ScanKernel)
  • Fixed small issue with hive types conversion
  • Fix for literal cast
  • Fixed issue with start and length of substring being different types
  • Fixed issue on logical plans when there is an EXISTS clause
  • Fixed issue with casting string to string
  • Fixed issue with getting table scan info
  • Fixed row_groups issue in ParquetParser.cpp
  • Fixed issue with some constant expressions not evaluated by calcite
  • Fixed issue with log directory creation in a distributed environment
  • Fixed issue where we were including testing hpp in our code
  • Fixed optimization regression on the select count(*) case
  • Fixed issue caused by using new arrow_io_source
  • Fixed e2e string comparison
  • Fixed random segfault issue in parser
  • Fixed issue with column names on sample function
  • Introduced config param for max orderby samples and fixed issue with oversampling in ORDER BY

v0.14.0

3 years ago

New Features:

  • New execution architecture, supporting executing queries on data that does not fit in the GPU. The new architecture features the following:

    • The execution model is an acyclic graph of execution nodes with a cache in between execution nodes.
    • Each execution node operates independently on batches of data, allowing it to process steps in parallel as much as possible instead of sequentially.
    • Each cache between every execution step can hold the data in GPU, in system memory or on disk.
    • Has support for multi-partition dask.cudf.DataFrame result set outputs.
  • Added ability to set configuration options

  • Added support for using NULL as a literal value

  • Implemented CHAR_LENGTH function

  • Added ability to specify region for S3 buckets

  • Added type normalization for UNION ALL

  • Added support for MinIO Storage

Improvements:

  • Improved support for CAST function to include TINYINT and SMALLINT
  • Handle behavior when the optimized plan contains a LogicalValues
  • Improvements to exception handling
  • Support modern compilers (>= g++-7.x)
  • Improved logging now uses spdlog
  • Adding event logging
  • BlazingSQL engine no longer needs to concatenate dask.cudf.DataFrame partitions prior to running a query on a dask.cudf.DataFrame table
  • Improved expression parser, including support for expression trees of unlimited size.
  • Optimized data loading for queries of the type: SELECT * FROM table LIMIT N
  • Added built in end to end testing framework
  • Added logging to condition variables that are waiting too long

Bug Fixes:

  • Fixed bug in size estimation for tables before joins
  • Fixed issue with excessive thread creation in communication
  • Fixed bug in expression parsing for joins
  • Fixed bug caused by sharing data loaders when a query has one table more than once
  • FIxed Hive file format inference

v0.13.0

4 years ago

New Features:

  • Support for AVG in distributed mode
  • Added ability to use existing memory allocator
  • Implemented unify_partitions function for preparing dask_cudf DataFrames prior to creating BlazingSQL tables
  • Implemented ROUND function
  • Implemented support for CASE with strings

Improvements:

  • Local files can be referenced with relative file paths when creating tables.
  • Automatic casting for joins on similar data types (i.e. joining an int32 with an int64 will cast the int32 to an int64)
  • Updated AWS SDK version
  • More changes to related to changes migration of libcudf to libcudf++
  • Added docstrings to main python APIs

Bug Fixes:

  • Fixed bug when for joining against empty DataFrame
  • Fixed bug with GROUP BY ignoring nulls
  • Fixed various issues related to creating tables from dask_cudf DataFrames
  • Fixed various bugs with creating tables from Hive Cursor
  • Fixed bugs related to new libcudf++ functionality
  • Fixed bug in LIMIT statement
  • Fixed bug in timestamp processing
  • Fixed bug in SUM0 aggregation (which enables COUNT DISTINCT)
  • Fixed bug when querying single file with multiple workers
  • Fixed bug with distributed COUNT aggregation without GROUP BY
  • Fixed bug when creating and querying a table with several Apache Parquet files and one is empty
  • Fixed bug with joins with nulls in the join key columns

Other:

  • Temporarily deprecated JSON reader. In the meantime we recommend using: cudf.read_json

v0.12.0

4 years ago

New Features:

  • Ability to skip reading and processing row groups when querying Apache Parquet files by applying predicates on metadata
  • Ability to do SELECT COUNT (DISTINCT column)
  • Ability to use and set Pool memory allocator for increased performance and/or managed (UVM) allocator which provides robustness against running out of GPU memory

Improvements:

  • New building scripts thanks to @dillon-cullinan

Bug Fixes:

  • Fixed various bugs in the Apache Arrow provider
  • Fixed bug with incorrect data type in CASE statements
  • Fixed bug and memory leak in distributed joins
  • Fixed bug in usage of Google Cloud Storage plugin