Starrocks Versions Save

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

3.2.2

3 months ago

Release date: December 30, 2023

Bug Fixes

Fixed the following issue:

  • When StarRocks is upgraded from v3.1.2 or earlier to v3.2, FEs may fail to restart. #38172

3.2.1

4 months ago

Release date: December 21, 2023

New Features

Data Lake Analytics

  • Supports reading Hive Catalog tables and file external tables in Avro, SequenceFile, and RCFile formats through Java Native Interface (JNI).

Materialized View

  • Added a view object_dependencies to the database sys. It contains the lineage information of asynchronous materialized views. #35060
  • Supports creating synchronous materialized views with the WHERE clause.
  • Supports partition-level incremental refresh for asynchronous materialized views created upon Iceberg catalogs.
  • [Preview] Supports creating asynchronous materialized views based on tables in a Paimon catalog with partition-level refresh.

Query and SQL functions

Monitoring and alerts

  • Added a new metric max_tablet_rowset_num for setting the maximum allowed number of rowsets. This metric helps detect possible compaction issues and thus reduces the occurrences of the error "too many versions". #36539

Parameter changes

  • A new BE configuration item enable_stream_load_verbose_log is added. The default value is false. With this parameter set to true, StarRocks can record the HTTP requests and responses for Stream Load jobs, making troubleshooting easier. #36113

Improvements

  • Upgraded the default GC algorithm in JDK8 to G1. #37268
  • A new value option GROUP_CONCAT_LEGACY is added to the session variable sql_mode to provide compatibility with the implementation logic of the group_concat function in versions earlier than v2.5. #36150
  • The authentication information aws.s3.access_key and aws.s3.access_secret for AWS S3 in Broker Load jobs are hidden in audit logs. #36571
  • The be_tablets view in the information_schema database provides a new field INDEX_DISK, which records the disk usage (measured in bytes) of persistent indexes. #35615
  • The result returned by the SHOW ROUTINE LOAD statement provides a new field OtherMsg, which shows information about the last failed task. #35806

Bug Fixes

Fixed the following issues:

  • The BEs crash if users create persistent indexes in the event of data corruption.#30841
  • The array_distinct function occasionally causes the BEs to crash. #36377
  • After the DISTINCT window operator pushdown feature is enabled, errors are reported if SELECT DISTINCT operations are performed on the complex expressions of the columns computed by window functions. #36357
  • Some S3-compatible object storage returns duplicate files, causing the BEs to crash. #36103

3.2.0

4 months ago

Release date: December 1, 2023

New Features

Shared-data cluster

  • Supports persisting indexes of Primary Key tables to local disks.
  • Supports even distribution of Data Cache among multiple local disks.

Materialized View

Asynchronous materialized view

  • The Query Dump file can include information of asynchronous materialized views.
  • The Spill to Disk feature is enabled by default for the refresh tasks of asynchronous materialized views, reducing memory consumption.

Data Lake Analytics

  • Supports creating and dropping databases and managed tables in Hive catalogs, and supports exporting data to Hive's managed tables using INSERT or INSERT OVERWRITE.
  • Supports Unified Catalog, with which users can access different table formats (Hive, Iceberg, Hudi, and Delta Lake) that share a common metastore like Hive metastore or AWS Glue.
  • Supports collecting statistics of Hive and Iceberg tables using ANALYZE TABLE, and storing the statistics in StarRocks, thus facilitating optimization of query plans and accelerating subsequent queries.
  • Supports Information Schema for external tables, providing additional convenience for interactions between external systems (such as BI tools) and StarRocks.

Storage engine, data ingestion, and export

  • Added the following features of loading with the table function FILES():
    • Loading Parquet and ORC format data from Azure or GCP.
    • Extracting the value of a key/value pair from the file path as the value of a column using the parameter columns_from_path.
    • Loading complex data types including ARRAY, JSON, MAP, and STRUCT.
  • Supports unloading data from StarRocks to Parquet-formatted files stored in AWS S3 or HDFS by using INSERT INTO FILES. For detailed instructions, see Unload data using INSERT INTO FILES.
  • Supports manual optimization of table structure and data distribution strategy used in an existing table to optimize the query and loading performance. You can set a new bucket key, bucket number, or sort key for a table. You can also set a different bucket number for specific partitions.
  • Supports continuous data loading from AWS S3 or HDFS using the PIPE method.
    • When PIPE detects new or modifications in a remote storage directory, it can automatically load the new or modified data into the destination table in StarRocks. While loading data, PIPE automatically splits a large loading task into smaller, serialized tasks, enhancing stability in large-scale data ingestion scenarios and reducing the cost of error retries.

Query

  • Supports HTTP SQL API, enabling users to access StarRocks data via HTTP and execute SELECT, SHOW, EXPLAIN, or KILL operations.
  • Supports Runtime Profile and text-based Profile analysis commands (SHOW PROFILELIST, ANALYZE PROFILE, EXPLAIN ANALYZE) to allow users to directly analyze profiles via MySQL clients, facilitating bottleneck identification and discovery of optimization opportunities.

SQL reference

Added the following functions:

  • String functions: substring_index, url_extract_parameter, url_encode, url_decode, and translate
  • Date functions: dayofweek_iso, week_iso, quarters_add, quarters_sub, milliseconds_add, milliseconds_sub, date_diff, jodatime_format, str_to_jodatime, to_iso8601, to_tera_date, and to_tera_timestamp
  • Pattern matching function: regexp_extract_all
  • hash function: xx_hash3_64
  • Aggregate functions: approx_top_k
  • Window functions: cume_dist, percent_rank and session_number
  • Utility functions: dict_mapping and get_query_profile

Privileges and security

StarRocks supports access control through Apache Ranger, providing a higher level of data security and allowing the reuse of existing services of external data sources. After integrating with Apache Ranger, StarRocks enables the following access control methods:

  • When accessing internal tables, external tables, or other objects in StarRocks, access control can be enforced based on the access policies configured for the StarRocks Service in Ranger.
  • When accessing an external catalog, access control can also leverage the corresponding Ranger service of the original data source (such as Hive Service) to control access (currently, access control for exporting data to Hive is not yet supported).

For more information, see Manage permissions with Apache Ranger.

Improvements

Data Lake Analytics

  • Optimized ORC Reader:
    • Optimized the ORC Column Reader, resulting in nearly a two-fold performance improvement for VARCHAR and CHAR data reading.
    • Optimized the decompression performance of ORC files in Zlib compression format.
  • Optimized Parquet Reader:
    • Supports adaptive I/O merging, allowing adaptive merging of columns with and without predicates based on filtering effects, thus reducing I/O.
    • Optimized Dict Filter for faster predicate rewriting. Supports STRUCT sub-columns, and on-demand dictionary column decoding.
    • Optimized Dict Decode performance.
    • Optimized late materialization performance.
    • Supports caching file footers to avoid repeated computation overhead.
    • Supports decompression of Parquet files in lzo compression format.
  • Optimized CSV Reader:
    • Optimized the Reader performance.
    • Supports decompression of CSV files in Snappy and lzo compression formats.
  • Optimized the performance of the count calculation.
  • Optimized Iceberg Catalog capabilities:
    • Supports collecting column statistics from Manifest files to accelerate queries.
    • Supports collecting NDV (number of distinct values) from Puffin files to accelerate queries.
    • Supports partition pruning.
    • Reduced Iceberg metadata memory consumption to enhance stability in scenarios with large metadata volume or high query concurrency.

Materialized View

Asynchronous materialized view

  • Supports automatic refresh for an asynchronous materialized view created upon views or materialized views when schema changes occur on the views, materialized views, or their base tables.
  • Data consistency:
    • Added the property query_rewrite_consistency for asynchronous materialized view creation. This property defines the query rewrite rules based on the consistency check.
    • Add the property force_external_table_query_rewrite for external catalog-based asynchronous materialized view creation. This property defines whether to allow force query rewrite for asynchronous materialized views created upon external catalogs.
    • For detailed information, see CREATE MATERIALIZED VIEW.
  • Added a consistency check for materialized views' partitioning key.
    • When users create an asynchronous materialized view with window functions that include a PARTITION BY expression, the partitioning column of the window function must match that of the materialized view.

Storage engine, data ingestion, and export

  • Optimized the persistent index for Primary Key tables by improving memory usage logic while reducing I/O read and write amplification. #24875 #27577 #28769
  • Supports data re-distribution across local disks for Primary Key tables.
  • Partitioned tables support automatic cooldown based on the partition time range and cooldown time. Compared to the original cooldown logic, it is more convenient to perform hot and cold data management on the partition level. For more information, see Specify initial storage medium, automatic storage cooldown time, replica number.
  • The Publish phase of a load job that writes data into a Primary Key table is changed from asynchronous mode to synchronous mode. As such, the data loaded can be queried immediately after the load job finishes. For more information, see enable_sync_publish
  • Supports Fast Schema Evolution, which is controlled by the table property fast_schema_evolution. After this feature is enabled, the execution efficiency of adding or dropping columns is significantly improved. This mode is disabled by default (Default value is false). You cannot modify this property for existing tables using ALTER TABLE.
  • Supports dynamically adjusting the number of tablets to create according to cluster information and the size of the data for Duplicate Key tables created with the Radom Bucketing strategy.

Query

  • Optimized StarRocks' compatibility with Metabase and Superset. Supports integrating them with external catalogs.

SQL Reference

  • array_agg supports the keyword DISTINCT.
  • INSERT, UPDATE, and DELETE operations now support SET_VAR. #35283

Others

  • Added the session variable large_decimal_underlying_type = "panic"|"double"|"decimal" to set the rules to deal with DECIMAL type overflow. panic indicates returning an error immediately, double indicates converting the data to DOUBLE type, and decimal indicates converting the data to DECIMAL(38,s).

Developer tools

  • Supports Trace Query Profile for asynchronous materialized views, which can be used to analyze its transparent rewrite.

Compatibility Changes

Upgrade Notes

  • Optimization on Random Bucketing is disabled by default. To enable it, you need to add the property bucket_size when creating tables. This allows the system to dynamically adjust the number of tablets based on cluster information and the size of loaded data. Please note that once this optimization is enabled, if you need to roll back your cluster to v3.1 or earlier, you must delete tables with this optimization enabled and manually execute a metadata checkpoint (by executing ALTER SYSTEM CREATE IMAGE). Otherwise, the rollback will fail.
  • Starting from v3.2.0, StarRocks has disabled non-Pipeline queries. Therefore, before upgrading your cluster to v3.2, you need to globally enable the Pipeline engine (by adding the configuration enable_pipeline_engine=true in the FE configuration file fe.conf). Failure to do so will result in errors for non-Pipeline queries.

Behavior Changes

To be updated.

Parameters

FE Configuration
  • Added the following FE configuration items:
    • catalog_metadata_cache_size
    • enable_backup_materialized_view
    • enable_colocate_mv_index
    • enable_fast_schema_evolution
    • json_file_size_limit
    • lake_enable_ingest_slowdown
    • lake_ingest_slowdown_threshold
    • lake_ingest_slowdown_ratio
    • lake_compaction_score_upper_bound
    • mv_auto_analyze_async
    • primary_key_disk_schedule_time
    • statistic_auto_collect_small_table_rows
    • stream_load_task_keep_max_num
    • stream_load_task_keep_max_second
  • Removed FE configuration item enable_pipeline_load.
  • Default value modifications:
    • The default value of enable_sync_publish is changed from false to true.
    • The default value of enable_persistent_index_by_default is changed from false to true.
BE Configuration
  • Data Cache-related configuration changes.

    • Added datacache_enable to replace block_cache_enable.
    • Added datacache_mem_size to replace block_cache_mem_size.
    • Added datacache_disk_size to replace block_cache_disk_size.
    • Added datacache_disk_path to replace block_cache_disk_path.
    • Added datacache_meta_path to replace block_cache_meta_path.
    • Added datacache_block_size to replace block_cache_block_size.
    • Added datacache_checksum_enable to replace block_cache_checksum_enable.
    • Added datacache_direct_io_enable to replace block_cache_direct_io_enable.
    • Added datacache_max_concurrent_inserts to replace block_cache_max_concurrent_inserts.
    • Added datacache_max_flying_memory_mb.
    • Added datacache_engine to replace block_cache_engine.
    • Removed block_cache_max_parcel_memory_mb.
    • Removed block_cache_report_stats.
    • Removed block_cache_lru_insertion_point.

    After renaming Block Cache to Data Cache, StarRocks has introduced a new set of BE parameters prefixed with datacache to replace the original parameters prefixed with block_cache. After upgrade to v3.2, the original parameters will still be effective. Once enabled, the new parameters will override the original ones. The mixed usage of new and original parameters is not supported, as it may result in some configurations not taking effect. In the future, StarRocks plans to deprecate the original parameters with the block_cache prefix, so we recommend you use the new parameters with the datacache prefix.

  • Added the following BE configuration items:

    • spill_max_dir_bytes_ratio
    • streaming_agg_limited_memory_size
    • streaming_agg_chunk_buffer_size
  • Removed the following BE configuration items:

    • Dynamic parameter tc_use_memory_min
    • Dynamic parameter tc_free_memory_rate
    • Dynamic parameter tc_gc_period
    • Static parameter tc_max_total_thread_cache_byte
  • Default value modifications:

    • The default value of disable_column_pool is changed from false to true.
    • The default value of txn_commit_rpc_timeout_ms is changed from 20000 to 60000.
    • The default value of thrift_port is changed from 9060 to 0.
    • The default value of enable_load_colocate_mv is changed from false to true.
    • The default value of enable_pindex_minor_compaction is changed from false to true.

System Variables

  • Added the following session variables:
    • enable_per_bucket_optmize
    • enable_write_hive_external_table
    • hive_temp_staging_dir
    • spill_revocable_max_bytes
    • thrift_plan_protocol
  • Removed the following session variables:
    • enable_pipeline_query_statistic
    • enable_deliver_batch_fragments
  • Renamed the following session variables:
    • enable_scan_block_cache is renamed as enable_scan_datacache.
    • enable_populate_block_cache is renamed as enable_populate_datacache.

Reserved Keywords

Added reserved keywords OPTIMIZE and PREPARE.

Bug Fixes

Fixed the following issues:

  • BEs crash when libcurl is invoked. #31667
  • Schema Change may fail if it takes an excessively long period of time, because the specified tablet version is handled by garbage collection. #31376
  • Failed to access the Parquet files in MinIO via file external tables. [#29873] (https://github.com/StarRocks/starrocks/pull/29873)
  • The ARRAY, MAP, and STRUCT type columns are not correctly displayed in information_schema.columns. #33431
  • An error is reported if specific path formats are used during data loading via Broker Load: msg:Fail to parse columnsFromPath, expected: [rec_dt]. #32720
  • DATA_TYPE and COLUMN_TYPE for BINARY or VARBINARY data types are displayed as unknown in the information_schema.columns view. #32678
  • Complex queries that involve many unions, expressions, and SELECT columns can result in a sudden surge in the bandwidth or CPU usage within an FE node.
  • The refresh of asynchronous materialized view may occasionally encounter deadlock. #35736

2.5.17

4 months ago

Release date: December 19, 2023

New Features

  • Added a new metric max_tablet_rowset_num for setting the maximum allowed number of rowsets. This metric helps detect possible compaction issues and thus reduces the occurrences of the error "too many versions". #36539
  • Added the subdivide_bitmap function. #35817

Improvements

  • The result returned by the SHOW ROUTINE LOAD statement provides a new field OtherMsg, which shows information about the last failed task. #35806
  • The default retention period of trash files is changed to 1 day from the original 3 days. #37113
  • Optimized the performance of persistent index update when compaction is performed on all rowsets of a Primary Key table, which reduces disk read I/O. #36819
  • Optimized the logic used to compute compaction scores for Primary Key tables, thereby aligning the compaction scores for Primary Key tables within a more consistent range with the other three table types. #36534
  • Queries on MySQL external tables and the external tables within JDBC catalogs support including keywords in the WHERE clause. #35917
  • Added the bitmap_from_binary function to Spark Load to support loading Binary data. #36050
  • The bRPC expiration time is shortened from 1 hour to the duration specified by the session variable query_timeout. This prevents query failures caused by RPC request expiration. #36778

Compatibility Changes

Parameters

  • A new BE configuration item enable_stream_load_verbose_log is added. The default value is false. With this parameter set to true, StarRocks can record the HTTP requests and responses for Stream Load jobs, making troubleshooting easier. #36113
  • The BE static parameter update_compaction_per_tablet_min_interval_seconds becomes mutable. #36819

Bug Fixes

Fixed the following issues:

  • Queries fail during hash joins, causing BEs to crash. #32219
  • The FE performance plunges after the FE configuration item enable_collect_query_detail_info is set to true. #35945
  • Errors may be thrown if large amounts of data are loaded into a Primary Key table with persistent index enabled. #34352
  • The starrocks_be process may exit unexpectedly when ./agentctl.sh stop be is used to stop a BE. #35108
  • The array_distinct function occasionally causes the BEs to crash. #36377
  • Deadlocks may occur when users refresh materialized views. #35736
  • In some scenarios, dynamic partitioning may encounter an error, which causes FE start failures. #36846

3.1.6

4 months ago

Release date: December 18, 2023

New Features

Parameter Changes

Improvements

Bug Fixes

Fixed the following issues:

2.5.16

4 months ago

Release date: December 1, 2023

Bug Fixes

Fixed the following issues:

2.5.15

4 months ago

Release date: November 29, 2023

Improvements

Compatibility Changes

Parameters

Bug Fixes

3.1.5

4 months ago

Release date: November 28, 2023

New features

Bug Fixes

Fixed the following issues:

Compatibility Changes

Parameters

System Variables

  • Added a session variable cbo_decimal_cast_string_strict, which controls how the CBO converts data from the DECIMAL type to the STRING type. If this variable is set to true, the logic built in v2.5.x and later versions prevails and the system implements strict conversion (namely, the system truncates the generated string and fills 0s based on the scale length). If this variable is set to false, the logic built in versions earlier than v2.5.x prevails and the system processes all valid digits to generate a string. The default value is true. https://github.com/StarRocks/starrocks/pull/34208
  • Added a session variable cbo_eq_base_type, which specifies the data type used for data comparison between DECIMAL-type data and STRING-type data. The default value is VARCHAR, and DECIMAL is also a valid value. https://github.com/StarRocks/starrocks/pull/34208
  • Added a session variable big_query_profile_second_threshold. When the session variable enable_profile is set to false and the amount of time taken by a query exceeds the threshold specified by the big_query_profile_second_threshold variable, a profile is generated for that query. https://github.com/StarRocks/starrocks/pull/33825

3.2.0-rc01

5 months ago

Release date: November 15, 2023

New Features

Shared-data cluster

Data Lake Analytics

  • Supports creating and dropping databases and managed tables in Hive catalogs, and supports exporting data to Hive's managed tables using INSERT or INSERT OVERWRITE.
  • Supports Unified Catalog, with which users can access different table formats (Hive, Iceberg, Hudi, and Delta Lake) that share a common metastore like Hive metastore or AWS Glue.

Storage engine, data ingestion, and export

  • Added the following features of loading with the table function FILES():
    • Loading Parquet and ORC format data from Azure or GCP.
    • Extracting the value of a key/value pair from the file path as the value of a column using the parameter columns_from_path.
    • Loading complex data types including ARRAY, JSON, MAP, and STRUCT.
  • Supports the dict_mapping column property, which can significantly facilitate the loading process during the construction of a global dictionary, accelerating the exact COUNT DISTINCT calculation.
  • Supports unloading data from StarRocks to Parquet-formatted files stored in AWS S3 or HDFS by using INSERT INTO FILES. For detailed instructions, see Unload data using INSERT INTO FILES.

SQL reference

Added the following functions:

  • String functions: substring_index, url_extract_parameter, url_encode, url_decode, and translate
  • Date functions: dayofweek_iso, week_iso, quarters_add, quarters_sub, milliseconds_add, milliseconds_sub, date_diff, jodatime_format, str_to_jodatime, to_iso8601, to_tera_date, and to_tera_timestamp
  • Pattern matching function: regexp_extract_all
  • hash function: xx_hash3_64
  • Aggregate functions: approx_top_k
  • Window functions: cume_dist, percent_rank and session_number
  • Utility functions: dict_mapping and get_query_profile

Privileges and security

StarRocks supports access control through Apache Ranger, providing a higher level of data security and allowing the reuse of existing Ranger Service of external data sources. After integrating with Apache Ranger, StarRocks enables the following access control methods:

  • When accessing internal tables, external tables, or other objects in StarRocks, access control can be enforced based on the access policies configured for the StarRocks Service in Ranger.
  • When accessing an external catalog, access control can also leverage the corresponding Ranger service of the original data source (such as Hive Service) to control access (currently, access control for exporting data to Hive is not yet supported).

For more information, see Manage permissions with Apache Ranger.

Improvements

Materialized View

Asynchronous materialized view

  • Creation: Supports automatic refresh for an asynchronous materialized view created upon views or materialized views when schema changes occur on the views, materialized views, or their base tables.
  • Observability: Supports Query Dump for asynchronous materialized views.
  • The Spill to Disk feature is enabled by default for the refresh tasks of asynchronous materialized views, reducing memory consumption.
  • Data consistency:
    • Added the property query_rewrite_consistency for asynchronous materialized view creation. This property defines the query rewrite rules based on the consistency check.
    • Add the property force_external_table_query_rewrite for external catalog-based asynchronous materialized view creation. This property defines whether to allow force query rewrite for asynchronous materialized views created upon external catalogs. For detailed information, see CREATE MATERIALIZED VIEW.
  • Added a consistency check for materialized views' partitioning key. When users create an asynchronous materialized view with window functions that include a PARTITION BY expression, the partitioning column of the window function must match that of the materialized view.

Storage engine, data ingestion, and export

  • Optimized the persistent index for Primary Key tables by improving memory usage logic while reducing I/O read and write amplification. #24875 #27577 #28769
  • Supports data re-distribution across local disks for Primary Key tables.
  • Partitioned tables support automatic cooldown based on the partition time range and cooldown time. For detailed information, see Set initial storage medium and automatic storage cooldown time.
  • The Publish phase of a load job that writes data into a Primary Key table is changed from asynchronous mode to synchronous mode. As such, the data loaded can be queried immediately after the load job finishes. For detailed information, see enable_sync_publish.

Query

  • Optimized StarRocks' compatibility with Metabase and Superset. Supports integrating them with external catalogs.

SQL Reference

  • array_agg supports the keyword DISTINCT.

Developer tools

  • Supports Trace Query Profile for asynchronous materialized views, which can be used to analyze its transparent rewrite.

Compatibility Changes

Parameters

  • Added new parameters for Data Cache.

Bug Fixes

Fixed the following issues:

  • BEs crash when libcurl is invoked. #31667
  • Schema Change may fail if it takes an excessive period of time, because the specified tablet version is handled by garbage collection. #31376
  • Failed to access the Parquet files in MinIO or AWS S3 via file external tables. #29873
  • The ARRAY, MAP, and STRUCT type columns are not correctly displayed in information_schema.columns. #33431
  • DATA_TYPE and COLUMN_TYPE for BINARY or VARBINARY data types are displayed as unknown in the information_schema.columns view. #32678

2.5.14

5 months ago

Release date: November 14, 2023

Improvements

Bug Fixes

Fixed the following issues: