Cudf Versions Save

cuDF - GPU DataFrame Library

v24.04.01

1 week ago

🚨 Breaking Changes

  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Deprecate groupby fillna (#15000) @mroeschke
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

  • Fix an issue with creating a series from scalar when dtype='category' (#15476) @galipremsagar
  • Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
  • [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
  • Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
  • Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
  • Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
  • Fix OOB read in inflate_kernel (#15309) @vuule
  • Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
  • Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
  • Fix Doxygen check (#15289) @KyleFromNVIDIA
  • Reintroduce PANDAS_GE_220 import (#15287) @wence-
  • Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
  • Fix Parquet decimal64 stats (#15281) @etseidl
  • Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
  • Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
  • Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
  • Fix number of rows in randomly generated lists columns (#15248) @vuule
  • Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
  • Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
  • Fix accessing .columns by an external API (#15212) @galipremsagar
  • [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
  • Update labeler and codeowner configs for CMake files (#15208) @PointKernel
  • Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
  • Fix memcheck error in distinct inner join (#15164) @PointKernel
  • Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
  • Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
  • Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
  • Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
  • Remove const from range_window_bounds::_extent. (#15138) @mythrocks
  • DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
  • Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
  • Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
  • Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
  • Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
  • Add support for arrow large_string in cudf (#15093) @galipremsagar
  • Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
  • Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
  • Fix bugs in handling of delta encodings (#15075) @etseidl
  • Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
  • Eliminate duplicate allocation of nested string columns (#15061) @vuule
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
  • Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
  • Raise for pyarrow array that is tz-aware (#14980) @mroeschke
  • Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
  • Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
  • unset CUDF_SPILL after a pytest (#14958) @galipremsagar
  • Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
  • Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
  • Fix reading offset for data stream in ORC reader (#14911) @ttnghia
  • Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
  • Fix dask token normalization (#14829) @rjzamora
  • Fix 24.04 versions (#14825) @raydouglass
  • Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

  • Ignore DLManagedTensor in the docs build (#15392) @davidwendt
  • Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
  • Temporarily disable docs errors. (#15265) @bdice
  • Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
  • Fix broken link for developer guide (#15025) @sanjana098
  • [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
  • Update cudf.pandas FAQ. (#14940) @bdice
  • Optimize doc builds (#14856) @vyasr
  • Add developer guideline to use east const. (#14836) @bdice
  • Document how cuDF is pronounced (#14753) @pentschev
  • Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

  • Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
  • Use JNI pinned pool resource with cuIO (#15255) @abellina
  • Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
  • Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
  • [JNI] rmm based pinned pool (#15219) @abellina
  • Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
  • Enable creation of columns from scalar (#15181) @vyasr
  • Use NVTX from GitHub. (#15178) @bdice
  • Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
  • Implement search using pylibcudf (#15166) @vyasr
  • Add distinct left join (#15149) @PointKernel
  • Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
  • Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
  • Automate include grouping order in .clang-format (#15063) @harrism
  • Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
  • API for JSON unquoted whitespace normalization (#15033) @shrshi
  • Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
  • Implement replace in pylibcudf (#15005) @vyasr
  • Add distinct key inner join (#14990) @PointKernel
  • Implement rolling in pylibcudf (#14982) @vyasr
  • Implement joins in pylibcudf (#14972) @vyasr
  • Implement scans and reductions in pylibcudf (#14970) @vyasr
  • Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
  • Implement groupby in pylibcudf (#14945) @vyasr
  • Support casting of Map type to string in JSON reader (#14936) @karthikeyann
  • POC for whitespace removal in input JSON data using FST (#14931) @shrshi
  • Support for LZ4 compression in ORC and Parquet (#14906) @vuule
  • Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
  • Migrate unary operations to pylibcudf (#14850) @vyasr
  • Migrate binary operations to pylibcudf (#14821) @vyasr
  • Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
  • Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

  • Backport: Relax protobuf lower bound to 3.20. (#15506) (#15610) @bdice
  • Use conda env create --yes instead of --force (#15403) @bdice
  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Enable branch testing for cudf.pandas (#15316) @galipremsagar
  • Replace black with ruff-format (#15312) @mroeschke
  • This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
  • Address poor performance of Parquet string decoding (#15304) @etseidl
  • Update script input name (#15301) @AyodeAwe
  • Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
  • Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
  • Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
  • Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
  • Implement grouped product scan (#15254) @wence-
  • Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
  • Implement DataFrame|Series.squeeze (#15244) @mroeschke
  • Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
  • Remove create_chars_child_column utility (#15241) @davidwendt
  • Update dlpack to version 0.8 (#15237) @dantegd
  • Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
  • Remove row conversion code from libcudf (#15234) @ttnghia
  • Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
  • Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
  • Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
  • Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
  • DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
  • Rewrite conversion in terms of column (#15213) @vyasr
  • Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
  • Deprecate strings_column_view::offsets_begin() (#15205) @davidwendt
  • Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
  • Tune up row size estimation in the data generator (#15202) @vuule
  • Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Fix includes for row_operators.cuh (#15194) @davidwendt
  • Generalize GHA selectors for pure Python testing (#15191) @bdice
  • Improvements for __cuda_array_interface__ tests (#15188) @bdice
  • Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
  • Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
  • Expose new stable_sort and finish stream_compaction in pylibcudf (#15175) @wence-
  • [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
  • Change make_strings_children to return uvector (#15171) @davidwendt
  • Don't override to_pandas for Datelike columns (#15167) @mroeschke
  • Drop python-snappy from dependencies. (#15161) @bdice
  • Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
  • Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
  • Java bindings for left outer distinct join (#15154) @jlowe
  • Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
  • Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
  • Add java option to keep quotes for JSON reads (#15146) @revans2
  • Change cross-pandas-version testing in cudf (#15145) @galipremsagar
  • Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
  • Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
  • Simplify some to_pandas implementations (#15123) @mroeschke
  • Java: Add leak tracking for Scalar instances (#15121) @jlowe
  • Remove calls to strings_column_view::offsets_begin() (#15112) @davidwendt
  • Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
  • Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
  • Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
  • Validate types in pylibcudf Column/Table constructors (#15088) @wence-
  • xfail test_join_ordering_pandas_compat for pandas 2.2 (#15080) @mroeschke
  • Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
  • Adjust test_binops for pandas 2.2 (#15078) @mroeschke
  • Remove offsets_begin() call from nvtext::generate_ngrams (#15077) @davidwendt
  • Use offsetalator in cudf::detail::has_nonempty_null_rows (#15076) @davidwendt
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
  • Add condition for test_groupby_nulls_basic in pandas 2.2 (#15072) @mroeschke
  • xfail tests in test_udf_masked_ops due to pandas 2.2 bug (#15071) @mroeschke
  • target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
  • Implement stable version of cudf::sort (#15066) @wence-
  • Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
  • Adjust test_joining for pandas 2.2 (#15060) @mroeschke
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
  • Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
  • Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
  • Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
  • Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
  • Use appropriate make_offsets_child_column for building lists columns (#15043) @davidwendt
  • Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
  • Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
  • Clean up nvtx macros (#15038) @PointKernel
  • Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
  • Expose libcudf filter expression in read_parquet (#15028) @wence-
  • Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
  • Adjust test_datetime_infer_format for pandas 2.2 (#15021) @mroeschke
  • Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
  • JNI bindings for distinct_hash_join (#15019) @jlowe
  • Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
  • Improve performance of copy_if_else for long strings (#15017) @davidwendt
  • Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
  • Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
  • Use offsetalator in cudf::get_json_object() (#15009) @davidwendt
  • Align integral types in ORC to specs (#15008) @vuule
  • Clean up detail sequence header inclusion (#15007) @PointKernel
  • Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
  • Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
  • Use offsetalator in cudf::row_bit_count() (#15003) @davidwendt
  • Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
  • Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
  • Deprecate groupby fillna (#15000) @mroeschke
  • Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
  • Remove unneeded calls to create_chars_child_column utility (#14997) @davidwendt
  • Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
  • Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Ensure that ctest is called with --no-tests=error. (#14983) @bdice
  • Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
  • Update ops-bot.yaml (#14974) @AyodeAwe
  • Use page statistics in Parquet reader (#14973) @etseidl
  • Use fused types for overloaded function signatures (#14969) @vyasr
  • Deprecate certain frequency strings (#14967) @galipremsagar
  • Update copyrights for 24.04. (#14964) @bdice
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
  • JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
  • Make codecov only informational (always pass). (#14952) @bdice
  • Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
  • Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
  • Update tests for pandas 2. (#14941) @bdice
  • Use more public pandas APIs (#14929) @mroeschke
  • Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
  • De-DOS line-endings (#14880) @wence-
  • Add detail cuco_allocator (#14877) @PointKernel
  • Move all core types to using enum class in Cython (#14876) @vyasr
  • Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
  • Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
  • Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
  • Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
  • Update cudf for compatibility with the latest cuco (#14849) @PointKernel
  • Remove deprecated strings functions (#14848) @davidwendt
  • Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
  • Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
  • Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
  • Fix calls to deprecated strings factory API in examples. (#14838) @bdice
  • Update pre-commit hooks (#14837) @bdice
  • Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
  • Remove get_mem_info functions from custom memory resources (#14832) @harrism
  • Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
  • Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
  • Branch 24.04 merge branch 24.02 (#14809) @vyasr
  • Branch 24.04 merge branch 24.02 (#14806) @vyasr
  • Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
  • Remove build_struct|list_column (#14786) @mroeschke
  • Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
  • Reduce execution time of Python ORC tests (#14776) @vuule
  • Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
  • Use offsetalator in cudf::strings::findall (#14745) @davidwendt
  • Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
  • Use get_offset_value utility in strings shift function (#14743) @davidwendt
  • Use as_column instead of full (#14698) @mroeschke
  • List all notable breaking changes (#13535) @galipremsagar

v24.04.00

4 weeks ago

🚨 Breaking Changes

  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Deprecate groupby fillna (#15000) @mroeschke
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

  • Fix an issue with creating a series from scalar when dtype='category' (#15476) @galipremsagar
  • Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
  • [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
  • Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
  • Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
  • Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
  • Fix OOB read in inflate_kernel (#15309) @vuule
  • Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
  • Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
  • Fix Doxygen check (#15289) @KyleFromNVIDIA
  • Reintroduce PANDAS_GE_220 import (#15287) @wence-
  • Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
  • Fix Parquet decimal64 stats (#15281) @etseidl
  • Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
  • Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
  • Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
  • Fix number of rows in randomly generated lists columns (#15248) @vuule
  • Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
  • Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
  • Fix accessing .columns by an external API (#15212) @galipremsagar
  • [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
  • Update labeler and codeowner configs for CMake files (#15208) @PointKernel
  • Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
  • Fix memcheck error in distinct inner join (#15164) @PointKernel
  • Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
  • Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
  • Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
  • Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
  • Remove const from range_window_bounds::_extent. (#15138) @mythrocks
  • DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
  • Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
  • Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
  • Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
  • Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
  • Add support for arrow large_string in cudf (#15093) @galipremsagar
  • Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
  • Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
  • Fix bugs in handling of delta encodings (#15075) @etseidl
  • Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
  • Eliminate duplicate allocation of nested string columns (#15061) @vuule
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
  • Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
  • Raise for pyarrow array that is tz-aware (#14980) @mroeschke
  • Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
  • Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
  • unset CUDF_SPILL after a pytest (#14958) @galipremsagar
  • Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
  • Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
  • Fix reading offset for data stream in ORC reader (#14911) @ttnghia
  • Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
  • Fix dask token normalization (#14829) @rjzamora
  • Fix 24.04 versions (#14825) @raydouglass
  • Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

  • Ignore DLManagedTensor in the docs build (#15392) @davidwendt
  • Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
  • Temporarily disable docs errors. (#15265) @bdice
  • Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
  • Fix broken link for developer guide (#15025) @sanjana098
  • [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
  • Update cudf.pandas FAQ. (#14940) @bdice
  • Optimize doc builds (#14856) @vyasr
  • Add developer guideline to use east const. (#14836) @bdice
  • Document how cuDF is pronounced (#14753) @pentschev
  • Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

  • Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
  • Use JNI pinned pool resource with cuIO (#15255) @abellina
  • Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
  • Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
  • [JNI] rmm based pinned pool (#15219) @abellina
  • Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
  • Enable creation of columns from scalar (#15181) @vyasr
  • Use NVTX from GitHub. (#15178) @bdice
  • Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
  • Implement search using pylibcudf (#15166) @vyasr
  • Add distinct left join (#15149) @PointKernel
  • Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
  • Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
  • Automate include grouping order in .clang-format (#15063) @harrism
  • Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
  • API for JSON unquoted whitespace normalization (#15033) @shrshi
  • Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
  • Implement replace in pylibcudf (#15005) @vyasr
  • Add distinct key inner join (#14990) @PointKernel
  • Implement rolling in pylibcudf (#14982) @vyasr
  • Implement joins in pylibcudf (#14972) @vyasr
  • Implement scans and reductions in pylibcudf (#14970) @vyasr
  • Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
  • Implement groupby in pylibcudf (#14945) @vyasr
  • Support casting of Map type to string in JSON reader (#14936) @karthikeyann
  • POC for whitespace removal in input JSON data using FST (#14931) @shrshi
  • Support for LZ4 compression in ORC and Parquet (#14906) @vuule
  • Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
  • Migrate unary operations to pylibcudf (#14850) @vyasr
  • Migrate binary operations to pylibcudf (#14821) @vyasr
  • Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
  • Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

  • Use conda env create --yes instead of --force (#15403) @bdice
  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Enable branch testing for cudf.pandas (#15316) @galipremsagar
  • Replace black with ruff-format (#15312) @mroeschke
  • This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
  • Address poor performance of Parquet string decoding (#15304) @etseidl
  • Update script input name (#15301) @AyodeAwe
  • Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
  • Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
  • Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
  • Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
  • Implement grouped product scan (#15254) @wence-
  • Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
  • Implement DataFrame|Series.squeeze (#15244) @mroeschke
  • Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
  • Remove create_chars_child_column utility (#15241) @davidwendt
  • Update dlpack to version 0.8 (#15237) @dantegd
  • Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
  • Remove row conversion code from libcudf (#15234) @ttnghia
  • Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
  • Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
  • Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
  • Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
  • DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
  • Rewrite conversion in terms of column (#15213) @vyasr
  • Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
  • Deprecate strings_column_view::offsets_begin() (#15205) @davidwendt
  • Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
  • Tune up row size estimation in the data generator (#15202) @vuule
  • Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Fix includes for row_operators.cuh (#15194) @davidwendt
  • Generalize GHA selectors for pure Python testing (#15191) @bdice
  • Improvements for __cuda_array_interface__ tests (#15188) @bdice
  • Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
  • Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
  • Expose new stable_sort and finish stream_compaction in pylibcudf (#15175) @wence-
  • [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
  • Change make_strings_children to return uvector (#15171) @davidwendt
  • Don't override to_pandas for Datelike columns (#15167) @mroeschke
  • Drop python-snappy from dependencies. (#15161) @bdice
  • Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
  • Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
  • Java bindings for left outer distinct join (#15154) @jlowe
  • Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
  • Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
  • Add java option to keep quotes for JSON reads (#15146) @revans2
  • Change cross-pandas-version testing in cudf (#15145) @galipremsagar
  • Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
  • Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
  • Simplify some to_pandas implementations (#15123) @mroeschke
  • Java: Add leak tracking for Scalar instances (#15121) @jlowe
  • Remove calls to strings_column_view::offsets_begin() (#15112) @davidwendt
  • Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
  • Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
  • Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
  • Validate types in pylibcudf Column/Table constructors (#15088) @wence-
  • xfail test_join_ordering_pandas_compat for pandas 2.2 (#15080) @mroeschke
  • Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
  • Adjust test_binops for pandas 2.2 (#15078) @mroeschke
  • Remove offsets_begin() call from nvtext::generate_ngrams (#15077) @davidwendt
  • Use offsetalator in cudf::detail::has_nonempty_null_rows (#15076) @davidwendt
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
  • Add condition for test_groupby_nulls_basic in pandas 2.2 (#15072) @mroeschke
  • xfail tests in test_udf_masked_ops due to pandas 2.2 bug (#15071) @mroeschke
  • target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
  • Implement stable version of cudf::sort (#15066) @wence-
  • Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
  • Adjust test_joining for pandas 2.2 (#15060) @mroeschke
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
  • Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
  • Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
  • Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
  • Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
  • Use appropriate make_offsets_child_column for building lists columns (#15043) @davidwendt
  • Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
  • Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
  • Clean up nvtx macros (#15038) @PointKernel
  • Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
  • Expose libcudf filter expression in read_parquet (#15028) @wence-
  • Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
  • Adjust test_datetime_infer_format for pandas 2.2 (#15021) @mroeschke
  • Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
  • JNI bindings for distinct_hash_join (#15019) @jlowe
  • Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
  • Improve performance of copy_if_else for long strings (#15017) @davidwendt
  • Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
  • Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
  • Use offsetalator in cudf::get_json_object() (#15009) @davidwendt
  • Align integral types in ORC to specs (#15008) @vuule
  • Clean up detail sequence header inclusion (#15007) @PointKernel
  • Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
  • Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
  • Use offsetalator in cudf::row_bit_count() (#15003) @davidwendt
  • Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
  • Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
  • Deprecate groupby fillna (#15000) @mroeschke
  • Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
  • Remove unneeded calls to create_chars_child_column utility (#14997) @davidwendt
  • Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
  • Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Ensure that ctest is called with --no-tests=error. (#14983) @bdice
  • Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
  • Update ops-bot.yaml (#14974) @AyodeAwe
  • Use page statistics in Parquet reader (#14973) @etseidl
  • Use fused types for overloaded function signatures (#14969) @vyasr
  • Deprecate certain frequency strings (#14967) @galipremsagar
  • Update copyrights for 24.04. (#14964) @bdice
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
  • JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
  • Make codecov only informational (always pass). (#14952) @bdice
  • Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
  • Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
  • Update tests for pandas 2. (#14941) @bdice
  • Use more public pandas APIs (#14929) @mroeschke
  • Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
  • De-DOS line-endings (#14880) @wence-
  • Add detail cuco_allocator (#14877) @PointKernel
  • Move all core types to using enum class in Cython (#14876) @vyasr
  • Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
  • Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
  • Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
  • Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
  • Update cudf for compatibility with the latest cuco (#14849) @PointKernel
  • Remove deprecated strings functions (#14848) @davidwendt
  • Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
  • Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
  • Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
  • Fix calls to deprecated strings factory API in examples. (#14838) @bdice
  • Update pre-commit hooks (#14837) @bdice
  • Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
  • Remove get_mem_info functions from custom memory resources (#14832) @harrism
  • Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
  • Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
  • Branch 24.04 merge branch 24.02 (#14809) @vyasr
  • Branch 24.04 merge branch 24.02 (#14806) @vyasr
  • Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
  • Remove build_struct|list_column (#14786) @mroeschke
  • Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
  • Reduce execution time of Python ORC tests (#14776) @vuule
  • Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
  • Use offsetalator in cudf::strings::findall (#14745) @davidwendt
  • Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
  • Use get_offset_value utility in strings shift function (#14743) @davidwendt
  • Use as_column instead of full (#14698) @mroeschke
  • List all notable breaking changes (#13535) @galipremsagar

v24.06.00a

1 month ago

🚨 Breaking Changes

  • Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
  • Remove legacy JSON reader from Python (#15538) @bdice
  • Removing all batching code from parquet writer (#15528) @mhaseeb123
  • Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
  • Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke

🐛 Bug Fixes

  • Fix operator precedence problem in Parquet reader (#15638) @etseidl
  • Fix debug warnings/errors in from_arrow_device_test.cpp (#15596) @davidwendt
  • Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
  • Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
  • Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
  • Preserve RangeIndex.step in to_arrow/from_arrow (#15581) @mroeschke
  • Ignore new cupy warning (#15574) @vyasr
  • Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
  • Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
  • Fix deprecation warnings for json legacy reader (#15563) @davidwendt
  • Fix millisecond resampling in cudf Python (#15560) @mroeschke
  • Rename JSON_READER_OPTION to JSON_READER_OPTION_NVBENCH. (#15553) @bdice
  • Fix a JNI bug in JSON parsing fixup (#15550) @revans2
  • Remove conda channel setup from wheel CI image script. (#15539) @bdice
  • cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
  • Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
  • Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
  • nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
  • Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
  • Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
  • Make improvements in pandas-test reporting (#15485) @galipremsagar
  • Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
  • Only use data_type constructor with scale for decimal types (#15472) @wence-
  • Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
  • Fix debug build errors from to_arrow_device_test.cpp (#15463) @davidwendt
  • Fix base_normalator::integer_sizeof_fn integer dispatch (#15457) @davidwendt
  • Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
  • Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
  • Handle case of scan aggregation in groupby-transform (#15450) @wence-
  • Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
  • Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
  • Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
  • Support implicit array conversion with query-planning enabled (#15378) @rjzamora
  • Fix arrow-based round trip of empty dataframes (#15373) @wence-
  • Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
  • Remove boundscheck=False setting in cython files (#15362) @wence-
  • Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
  • Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
  • Disable dask-expr in docs builds. (#15343) @bdice
  • Apply the cuFile error work around to data_sink as well (#15335) @vuule
  • Check column type equality, handling nested types correctly. (#14531) @bdice

📖 Documentation

  • Update developer guide with device_async_resource_ref guidelines (#15562) @harrism
  • DOC: add pandas intersphinx mapping (#15531) @raybellwaves
  • rm-dup-doc in frame.py (#15530) @raybellwaves
  • Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
  • Doc: interleave columns pandas compat (#15383) @raybellwaves
  • Add debug tips section to libcudf developer guide (#15329) @davidwendt
  • Fix and clarify notes on result ordering (#13255) @shwina

🚀 New Features

  • Concatenate dictionary of objects along axis=1 (#15623) @er-eis
  • Construct pylibcudf columns from objects supporting __cuda_array_interface__ (#15615) @brandon-b-miller
  • Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
  • Fea/move to latest nanoarrow (#15526) @robertmaynard
  • Migrate string case operations to pylibcudf (#15489) @brandon-b-miller
  • Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
  • Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
  • Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
  • Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
  • Add BYTE_STREAM_SPLIT support to Parquet (#15311) @etseidl
  • Introduce benchmark suite for JSON reader options (#15124) @shrshi
  • Implement ORC chunked reader (#15094) @ttnghia
  • Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
  • Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade
  • Add JSON option to prune columns (#14996) @karthikeyann

🛠️ Improvements

  • Enable warnings as errors in custreamz (#15642) @mroeschke
  • Fix -Werror=type-limits. (#15635) @bdice
  • Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
  • Remove NVBench SHA override. (#15633) @alliepiper
  • Add support for large string columns to Parquet reader and writer (#15632) @etseidl
  • Large strings support in MD5 and SHA hashers (#15631) @davidwendt
  • Fix make_offsets_child_column usage in cudf::strings::detail::shift (#15630) @davidwendt
  • Use experimental make_strings_children for strings convert (#15629) @davidwendt
  • Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
  • Make ColumnBase.cuda_array_interface opt out instead of opt in (#15622) @mroeschke
  • Large strings support for cudf::gather (#15621) @davidwendt
  • Remove jni-docker-build workflow (#15619) @bdice
  • Drop Centos7 support (#15608) @NvTimLiu
  • Use experimental make_strings_children for json/csv writers (#15599) @davidwendt
  • Use experimental make_strings_children for strings join/url_encode/slice (#15598) @davidwendt
  • Use experimental make_strings_children in nvtext APIs (#15595) @davidwendt
  • Deprecate to/from_dask_dataframe APIs in dask-cudf (#15592) @rjzamora
  • Minor fixups for future NumPy 2 compatibility (#15590) @seberg
  • Use experimental make_strings_children for capitalize/case/pad functions (#15587) @davidwendt
  • Use experimental make_strings_children for strings replace/filter/translate (#15586) @davidwendt
  • Don't materialize column during RangeIndex methods (#15582) @mroeschke
  • Improve performance for cudf::strings::count_re (#15578) @davidwendt
  • Replace RangeIndex._start/_stop/_step with _range (#15576) @mroeschke
  • Rename experimental JSON tests. (#15568) @bdice
  • Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
  • Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
  • Deprecate legacy JSON reader options. (#15558) @bdice
  • Use same .clang-format in cuDF JNI (#15557) @bdice
  • Large strings support for cudf::fill (#15555) @davidwendt
  • Upgrade upper bound pinning to pandas-2.2.2 (#15554) @galipremsagar
  • Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
  • Move timezone conversion logic to DatetimeColumn (#15545) @mroeschke
  • Large strings support for cudf::interleave_columns (#15544) @davidwendt
  • [skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
  • Remove checks dependency from static-configure test job. (#15542) @bdice
  • Remove legacy JSON reader from Python (#15538) @bdice
  • Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
  • Large strings support for cudf::clamp (#15533) @davidwendt
  • Remove version hard-coding (#15529) @galipremsagar
  • Removing all batching code from parquet writer (#15528) @mhaseeb123
  • Make some private class properties not settable (#15527) @mroeschke
  • Large strings support in regex replace APIs (#15524) @davidwendt
  • Skip pandas unit tests that crash pytest workers in cudf.pandas (#15521) @mroeschke
  • Preserve column metadata during more DataFrame operations (#15519) @mroeschke
  • Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
  • Large strings gtest fixture and utilities (#15513) @davidwendt
  • Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
  • Relax protobuf lower bound to 3.20. (#15506) @bdice
  • Clean up index methods (#15496) @mroeschke
  • Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
  • Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
  • Clean up cuda_array_interface handling in as_column (#15477) @mroeschke
  • Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
  • Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
  • Use cached_property for NumericColumn.nan_count instead of ._nan_count variable (#15466) @mroeschke
  • Add to_arrow_device() functions that accept views (#15465) @davidwendt
  • Add custom status check workflow (#15464) @galipremsagar
  • Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
  • Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
  • Add from_arrow_device function to cudf interop using nanoarrow (#15458) @zeroshade
  • Remove deprecated strings offsets_begin (#15454) @davidwendt
  • Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
  • Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
  • Enable tests/io/test_user_agent.py in cudf pandas tests (#15442) @mroeschke
  • Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
  • Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
  • Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
  • Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
  • Unify Copy-On-Write and Spilling (#15436) @madsbk
  • Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
  • Bump ruff and codespell pre-commit checks (#15407) @mroeschke
  • Enable all tests for arm arch (#15402) @galipremsagar
  • Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
  • Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
  • add correct labels to pandas_function_request.md (#15381) @raybellwaves
  • Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
  • Large strings support in cudf::merge (#15374) @davidwendt
  • Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
  • Use logical types in Parquet reader (#15365) @etseidl
  • Add experimental make_strings_children utility (#15363) @davidwendt
  • Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
  • Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
  • Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
  • Refactor stream mode setup for gtests (#15337) @davidwendt
  • Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
  • Avoid duplicate dask-cudf testing (#15333) @rjzamora
  • Update udf_cpp to use rapids_cpm_cccl. (#15331) @bdice
  • Forward-merge branch-24.04 into branch-24.06 [skip ci] (#15330) @rapids-bot[bot]
  • Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
  • Drop CentOS 7 support. (#15323) @bdice
  • Rework cudf::find_and_replace_all to use gather-based make_strings_column (#15305) @davidwendt
  • First pass at adding testing for pylibcudf (#15300) @vyasr
  • [FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
  • Rework cudf::replace_nulls to use strings::detail::copy_if_else (#15286) @davidwendt
  • Clean up special casing in as_column for non-typed input (#15276) @mroeschke
  • Large strings support in cudf::concatenate (#15195) @davidwendt
  • Use less _is_categorical_dtype (#15148) @mroeschke
  • Align date_range defaults with pandas, support tz (#15139) @mroeschke
  • ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
  • Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
  • Cleanup some timedelta/datetime column logic (#14715) @mroeschke
  • Refactor numpy array input in as_column (#14651) @mroeschke
  • Refactor joins for conditional semis and antis (#14646) @DanialJavady96

v24.02.02

2 months ago

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

  • Bump to nvcomp 3.0.6. (#15128) @bdice
  • [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use column_empty over as_column([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

🚀 New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use make_strings_children for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over is_foo_dtype (#14641) @mroeschke
  • Use isinstance over is_foo_dtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet pass_read_limit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpe_merge_pairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr

v24.02.01

2 months ago

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

  • [HOTFIX] Unpin numba<0.58 (#15031) @raydouglass
  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use column_empty over as_column([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

🚀 New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use make_strings_children for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over is_foo_dtype (#14641) @mroeschke
  • Use isinstance over is_foo_dtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet pass_read_limit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpe_merge_pairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr

v24.02.00

2 months ago

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

  • Exclude tests from builds (#14981) @vyasr
  • Fix the bounce buffer size in ORC writer (#14947) @vuule
  • Revert sum/product aggregation to always produce int64_t type (#14907) @SurajAralihalli
  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use column_empty over as_column([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

🚀 New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

  • Pin pytest&lt;8 (#14920) @galipremsagar
  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use make_strings_children for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over is_foo_dtype (#14641) @mroeschke
  • Use isinstance over is_foo_dtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet pass_read_limit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpe_merge_pairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr

v24.04.00a

3 months ago

🚨 Breaking Changes

  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Deprecate groupby fillna (#15000) @mroeschke
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel

🐛 Bug Fixes

  • Update pre-commit-hooks to v0.0.3 (#15355) @KyleFromNVIDIA
  • [BUG][JNI] Trigger MemoryBuffer.onClosed after memory is freed (#15351) @abellina
  • Fix an issue with multiple short list rowgroups using the Parquet chunked reader. (#15342) @nvdbaranec
  • Avoid importing dask-expr if "query-planning" config is False (#15340) @rjzamora
  • Fix gtests/ERROR_TEST errors when run in Debug (#15317) @davidwendt
  • Fix OOB read in inflate_kernel (#15309) @vuule
  • Work around a cuFile error when running CSV tests with memcheck (#15293) @vuule
  • Fix Doxygen upload directory (#15291) @KyleFromNVIDIA
  • Fix Doxygen check (#15289) @KyleFromNVIDIA
  • Reintroduce PANDAS_GE_220 import (#15287) @wence-
  • Fix mean computation for the geometric distribution in the data generator (#15282) @vuule
  • Fix Parquet decimal64 stats (#15281) @etseidl
  • Make linking of nvtx3-cpp BUILD_LOCAL_INTERFACE (#15271) @KyleFromNVIDIA
  • Workaround compute-sanitizer memcheck bug (#15259) @davidwendt
  • Cleanup hostdevice_vector and add more APIs (#15252) @ttnghia
  • Fix number of rows in randomly generated lists columns (#15248) @vuule
  • Fix wrong output for collect_list/collect_set of lists column (#15243) @ttnghia
  • Fix testchunkedPackTwoPasses to copy from the bounce buffer (#15220) @abellina
  • Fix accessing .columns by an external API (#15212) @galipremsagar
  • [JNI] Disable testChunkedPackTwoPasses for now (#15210) @abellina
  • Update labeler and codeowner configs for CMake files (#15208) @PointKernel
  • Avoid dict normalization in __dask_tokenize__ (#15187) @rjzamora
  • Fix memcheck error in distinct inner join (#15164) @PointKernel
  • Remove unneeded script parameters in test_cpp_memcheck.sh (#15158) @davidwendt
  • Fix ListColumn.to_pandas() to retain list type (#15155) @galipremsagar
  • Avoid factorization in MultiIndex.to_pandas (#15150) @mroeschke
  • Fix GroupBy.get_group and GroupBy.indices (#15143) @wence-
  • Remove const from range_window_bounds::_extent. (#15138) @mythrocks
  • DataFrame.columns = ... retains RangeIndex & set dtype (#15129) @mroeschke
  • Correctly handle output for GroupBy.apply when chunk results are reindexed series (#15109) @brandon-b-miller
  • Fix Series.groupby.shift with a MultiIndex (#15098) @mroeschke
  • Fix reductions when DataFrame has MulitIndex columns (#15097) @mroeschke
  • Fix deprecation warnings for deprecated hash() calls (#15095) @davidwendt
  • Add support for arrow large_string in cudf (#15093) @galipremsagar
  • Fix sort_values pytest failure with pandas-2.x regression (#15092) @galipremsagar
  • Resolve path parsing issues in get_json_object (#15082) @SurajAralihalli
  • Fix bugs in handling of delta encodings (#15075) @etseidl
  • Fix is_device_write_preferred in void_sink and user_sink_wrapper (#15064) @vuule
  • Eliminate duplicate allocation of nested string columns (#15061) @vuule
  • Raise an error on import for unsupported GPUs. (#15053) @bdice
  • Align concat Series name behavior in pandas 2.2 (#15032) @mroeschke
  • Fix Index.difference to handle duplicate values when one of the inputs is empty (#15016) @galipremsagar
  • Add future_stack to DataFrame.stack (#15015) @galipremsagar
  • Fix handling of values=None in pylibcudf GroupBy.get_groups (#14998) @shwina
  • Fix DataFrame.sort_index to respect ignore_index on all axis (#14995) @galipremsagar
  • Raise for pyarrow array that is tz-aware (#14980) @mroeschke
  • Direct SeriesGroupBy.aggregate to SeriesGroupBy.agg (#14971) @rjzamora
  • Respect IntervalDtype and CategoricalDtype objects passed by users (#14961) @mroeschke
  • unset CUDF_SPILL after a pytest (#14958) @galipremsagar
  • Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader (#14939) @karthikeyann
  • Fix chunked reads of Parquet delta encoded pages (#14921) @etseidl
  • Fix reading offset for data stream in ORC reader (#14911) @ttnghia
  • Enable sanitizer check for a test case testORCReadAndWriteForDecimal128 (#14897) @res-life
  • Fix dask token normalization (#14829) @rjzamora
  • Fix 24.04 versions (#14825) @raydouglass
  • Ensure slow private attrs are maybe proxies (#14380) @mroeschke

📖 Documentation

  • Revert "Temporarily disable docs errors. (#15265)" (#15269) @bdice
  • Temporarily disable docs errors. (#15265) @bdice
  • Update developer_guide.md with new guidance on quoted internal includes (#15238) @harrism
  • Fix broken link for developer guide (#15025) @sanjana098
  • [DOC] Update typo in docs example of structs_column_wrapper (#14949) @karthikeyann
  • Update cudf.pandas FAQ. (#14940) @bdice
  • Optimize doc builds (#14856) @vyasr
  • Add developer guideline to use east const. (#14836) @bdice
  • Document how cuDF is pronounced (#14753) @pentschev
  • Notes convert to Pandas-compat (#12641) @Touutae-lab

🚀 New Features

  • Address inconsistency in single quote normalization in JSON reader (#15324) @shrshi
  • Use JNI pinned pool resource with cuIO (#15255) @abellina
  • Add DELTA_BYTE_ARRAY encoder for Parquet (#15239) @etseidl
  • Migrate filling operations to pylibcudf (#15225) @brandon-b-miller
  • [JNI] rmm based pinned pool (#15219) @abellina
  • Implement zero-copy host buffer source instead of using an arrow implementation (#15189) @vuule
  • Enable creation of columns from scalar (#15181) @vyasr
  • Use NVTX from GitHub. (#15178) @bdice
  • Implement segmented_row_bit_count for computing row sizes by segments of rows (#15169) @ttnghia
  • Implement search using pylibcudf (#15166) @vyasr
  • Add distinct left join (#15149) @PointKernel
  • Add cardinality control for groupby benchs with flat types (#15134) @PointKernel
  • Add ability to request Parquet encodings on a per-column basis (#15081) @etseidl
  • Automate include grouping order in .clang-format (#15063) @harrism
  • Requesting a clean build directory also clears Jitify cache (#15052) @robertmaynard
  • API for JSON unquoted whitespace normalization (#15033) @shrshi
  • Implement concatenate, lists.explode, merge, sorting, and stream compaction in pylibcudf (#15011) @vyasr
  • Implement replace in pylibcudf (#15005) @vyasr
  • Add distinct key inner join (#14990) @PointKernel
  • Implement rolling in pylibcudf (#14982) @vyasr
  • Implement joins in pylibcudf (#14972) @vyasr
  • Implement scans and reductions in pylibcudf (#14970) @vyasr
  • Rewrite cudf internals using pylibcudf groupby (#14946) @vyasr
  • Implement groupby in pylibcudf (#14945) @vyasr
  • Support casting of Map type to string in JSON reader (#14936) @karthikeyann
  • POC for whitespace removal in input JSON data using FST (#14931) @shrshi
  • Support for LZ4 compression in ORC and Parquet (#14906) @vuule
  • Remove supports_streams from cuDF custom memory resources. (#14857) @harrism
  • Migrate unary operations to pylibcudf (#14850) @vyasr
  • Migrate binary operations to pylibcudf (#14821) @vyasr
  • Add row index and stripe size options to Python ORC chunked writer (#14785) @vuule
  • Support CUDA 12.2 (#14712) @jameslamb

🛠️ Improvements

  • Restructure pylibcudf/arrow interop facilities (#15325) @vyasr
  • Change exceptions thrown by copying APIs (#15319) @vyasr
  • Enable branch testing for cudf.pandas (#15316) @galipremsagar
  • Replace black with ruff-format (#15312) @mroeschke
  • This fixes an NPE when trying to read empty JSON data by adding a new API for missing information (#15307) @revans2
  • Address poor performance of Parquet string decoding (#15304) @etseidl
  • Update script input name (#15301) @AyodeAwe
  • Make test_read_parquet_partitioned_filtered data deterministic (#15296) @mroeschke
  • Add timeout for cudf.pandas pandas tests (#15284) @galipremsagar
  • Add upper bound to prevent usage of NumPy 2 (#15283) @bdice
  • Fix cudf::test::to_host return of host_vector (#15263) @davidwendt
  • Implement grouped product scan (#15254) @wence-
  • Add CUDA 12.4 to supported PTX versions (#15247) @brandon-b-miller
  • Implement DataFrame|Series.squeeze (#15244) @mroeschke
  • Roll back ipow changes due to register pressure. (#15242) @pmattione-nvidia
  • Remove create_chars_child_column utility (#15241) @davidwendt
  • Update dlpack to version 0.8 (#15237) @dantegd
  • Improve performance in JSON reader when mixed_types_as_string option is enabled (#15236) @shrshi
  • Remove row conversion code from libcudf (#15234) @ttnghia
  • Use variable substitution for RAPIDS version in Doxyfile (#15231) @KyleFromNVIDIA
  • Add ListColumns.to_pandas(arrow_type=) (#15228) @mroeschke
  • Treat dask-cudf CI artifacts as pure wheels (#15223) @bdice
  • Clean up usage of CUDA_ARCH and other macros. (#15218) @bdice
  • DOC: use constants in performance-comparisons.ipynb (#15215) @raybellwaves
  • Rewrite conversion in terms of column (#15213) @vyasr
  • Switch pytest-xdist algo to worksteal (#15207) @galipremsagar
  • Deprecate strings_column_view::offsets_begin() (#15205) @davidwendt
  • Add get_upstream_resource method to stream_checking_resource_adaptor (#15203) @miscco
  • Tune up row size estimation in the data generator (#15202) @vuule
  • Fix offset value for generating test data in parquet_chunked_reader_test.cu (#15200) @ttnghia
  • Change strings_column_view::char_size to return int64 (#15197) @davidwendt
  • Fix includes for row_operators.cuh (#15194) @davidwendt
  • Generalize GHA selectors for pure Python testing (#15191) @bdice
  • Improvements for __cuda_array_interface__ tests (#15188) @bdice
  • Allow to_pandas to return pandas.ArrowDtype (#15182) @mroeschke
  • Ignore byte_range in read_json when the size is not smaller than the input data (#15180) @vuule
  • Expose new stable_sort and finish stream_compaction in pylibcudf (#15175) @wence-
  • [ci] update matrix filters for dask-cudf builds (#15174) @jameslamb
  • Change make_strings_children to return uvector (#15171) @davidwendt
  • Don't override to_pandas for Datelike columns (#15167) @mroeschke
  • Drop python-snappy from dependencies. (#15161) @bdice
  • Add microkernels for fixed-width and fixed-width dictionary in Parquet decode (#15159) @abellina
  • Make HostColumnVector.DataType accessor methods public (#15157) @jbrennan333
  • Java bindings for left outer distinct join (#15154) @jlowe
  • Forward-merge branch-24.02 to branch-24.04 (#15153) @bdice
  • Enable pandas pytests for cudf.pandas (#15147) @galipremsagar
  • Add java option to keep quotes for JSON reads (#15146) @revans2
  • Change cross-pandas-version testing in cudf (#15145) @galipremsagar
  • Use hostdevice_vector in kernel_error to avoid the pageable copy (#15140) @vuule
  • Clean up Columns.astype & cudf.dtype (#15125) @mroeschke
  • Simplify some to_pandas implementations (#15123) @mroeschke
  • Java: Add leak tracking for Scalar instances (#15121) @jlowe
  • Remove calls to strings_column_view::offsets_begin() (#15112) @davidwendt
  • Add support for Python 3.11, require NumPy 1.23+ (#15111) @jameslamb
  • Compile-time ipow computation with array lookup (#15110) @pmattione-nvidia
  • Upgrade to arrow-14.0.2 (#15108) @galipremsagar
  • Dynamically set version in RAPIDS doc builds (#15101) @jakirkham
  • Add support for pandas-2.2 in cudf (#15100) @galipremsagar
  • Update devcontainers to CUDA Toolkit 12.2 (#15099) @trxcllnt
  • Fix datetime binop pytest failures in pandas-2.2 (#15090) @galipremsagar
  • Validate types in pylibcudf Column/Table constructors (#15088) @wence-
  • xfail test_join_ordering_pandas_compat for pandas 2.2 (#15080) @mroeschke
  • Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation. (#15079) @nvdbaranec
  • Adjust test_binops for pandas 2.2 (#15078) @mroeschke
  • Remove offsets_begin() call from nvtext::generate_ngrams (#15077) @davidwendt
  • Use offsetalator in cudf::detail::has_nonempty_null_rows (#15076) @davidwendt
  • Deprecate cudf::hashing::spark_murmurhash3_x86_32 (#15074) @davidwendt
  • Fix cudf::test::to_host to handle both offset types for strings columns (#15073) @davidwendt
  • Add condition for test_groupby_nulls_basic in pandas 2.2 (#15072) @mroeschke
  • xfail tests in test_udf_masked_ops due to pandas 2.2 bug (#15071) @mroeschke
  • target branch-24.04 for GitHub Actions workflows (#15069) @jameslamb
  • Implement stable version of cudf::sort (#15066) @wence-
  • Fix ORC and JSON tests failures for pandas 2.2 (#15062) @mroeschke
  • Adjust test_joining for pandas 2.2 (#15060) @mroeschke
  • Align MultiIndex.get_indexder with pandas 2.2 change (#15059) @mroeschke
  • Fix test_resample index dtype checking for pandas 2.2 (#15058) @mroeschke
  • Split out strings/replace.cu and rework its gtests (#15054) @davidwendt
  • Avoid incompatible value type setting in test_rolling for pandas 2.2 (#15050) @mroeschke
  • Change chained replace inplace test to COW test for pandas 2.2 (#15049) @mroeschke
  • Deprecate datelike isin casting strings to dates to match pandas 2.2 (#15046) @mroeschke
  • Avoid chained indexing in test_indexing for pandas 2.2 (#15045) @mroeschke
  • Avoid pandas 2.2 DeprecationWarning in test_hdf (#15044) @mroeschke
  • Use appropriate make_offsets_child_column for building lists columns (#15043) @davidwendt
  • Factor out position-offsets logic from strings split_helper utility (#15040) @davidwendt
  • Forward-merge branch-24.02 to branch-24.04 (#15039) @bdice
  • Clean up nvtx macros (#15038) @PointKernel
  • Add xfailures for test_applymap for pandas 2.2 (#15034) @mroeschke
  • Expose libcudf filter expression in read_parquet (#15028) @wence-
  • Adjust tests in test_dataframe.py for pandas 2.2 (#15023) @mroeschke
  • Adjust test_datetime_infer_format for pandas 2.2 (#15021) @mroeschke
  • Performance optimizations for parquet sub-rowgroup reader. (#15020) @nvdbaranec
  • JNI bindings for distinct_hash_join (#15019) @jlowe
  • Change copy_if_safe to call thrust instead of the overload function (#15018) @davidwendt
  • Improve performance of copy_if_else for long strings (#15017) @davidwendt
  • Fix is_string_dtype test for pandas 2.2 (#15012) @mroeschke
  • Rework cudf::strings::detail::copy_range for offsetalator (#15010) @davidwendt
  • Use offsetalator in cudf::get_json_object() (#15009) @davidwendt
  • Align integral types in ORC to specs (#15008) @vuule
  • Clean up detail sequence header inclusion (#15007) @PointKernel
  • Add groupby.apply(include_groups=) to match pandas 2.2 deprecation (#15006) @mroeschke
  • Use offsetalator in cudf::interleave_columns() (#15004) @davidwendt
  • Use offsetalator in cudf::row_bit_count() (#15003) @davidwendt
  • Use offsetalator in cudf::strings::wrap() (#15002) @davidwendt
  • Use offsetalator in cudf::strings::reverse (#15001) @davidwendt
  • Deprecate groupby fillna (#15000) @mroeschke
  • Ensure to_* IO methods respect pandas 2.2 keyword only deprecation (#14999) @mroeschke
  • Remove unneeded calls to create_chars_child_column utility (#14997) @davidwendt
  • Add environment-agnostic scripts for running ctests and pytests (#14992) @trxcllnt
  • Filter all DeprecationWarning's by ArrowTable.to_pandas() (#14989) @galipremsagar
  • Deprecate replace with categorical columns (#14988) @mroeschke
  • Deprecate delim_whitespace in read_csv for pandas 2.2 (#14986) @mroeschke
  • Deprecate parameters similar to pandas 2.2 (#14984) @mroeschke
  • Ensure that ctest is called with --no-tests=error. (#14983) @bdice
  • Deprecate non-integer periods in date_range and interval_range (#14976) @galipremsagar
  • Update ops-bot.yaml (#14974) @AyodeAwe
  • Use page statistics in Parquet reader (#14973) @etseidl
  • Use fused types for overloaded function signatures (#14969) @vyasr
  • Deprecate certain frequency strings (#14967) @galipremsagar
  • Update copyrights for 24.04. (#14964) @bdice
  • Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. (#14962) @bdice
  • Introduce GetJsonObjectOptions in getJSONObject Java API (#14956) @SurajAralihalli
  • JNI JSON read with DataSource and infered schema, along with basic java nested Schema JSON reads (#14954) @revans2
  • Make codecov only informational (always pass). (#14952) @bdice
  • Replace legacy cudf and dask_cudf imports as (d)gd (#14944) @mroeschke
  • Replace _is_datetime64tz/interval_dtype with isinstance (#14943) @mroeschke
  • Update tests for pandas 2. (#14941) @bdice
  • Use more public pandas APIs (#14929) @mroeschke
  • Replace local copyright check with pre-commit-hooks verify-copyright (#14917) @KyleFromNVIDIA
  • Add pandas-2.x support in cudf (#14916) @galipremsagar
  • Use offsetalator in nvtext::byte_pair_encoding (#14888) @davidwendt
  • De-DOS line-endings (#14880) @wence-
  • Add detail cuco_allocator (#14877) @PointKernel
  • Move all core types to using enum class in Cython (#14876) @vyasr
  • Read cudf.__version__ in Sphinx build (#14872) @KyleFromNVIDIA
  • Use int64 offset types for accessing code-points in nvtext::normalize (#14868) @davidwendt
  • Read version from VERSION file in CMake (#14867) @KyleFromNVIDIA
  • Update conda-cpp-post-build-checks to branch-24.04. (#14854) @bdice
  • Update cudf for compatibility with the latest cuco (#14849) @PointKernel
  • Remove deprecated strings functions (#14848) @davidwendt
  • Fix CI workflows for pandas-tests and add test summary. (#14847) @bdice
  • Use offsetalator in cudf::strings::copy_slice (#14844) @davidwendt
  • Fix V2 Parquet page alignment for use with zStandard compression (#14841) @etseidl
  • Fix calls to deprecated strings factory API in examples. (#14838) @bdice
  • Update pre-commit hooks (#14837) @bdice
  • Use rapids_cuda_set_runtime to determine cuda runtime usage by target (#14833) @vyasr
  • Remove get_mem_info functions from custom memory resources (#14832) @harrism
  • Fix debug build by splitting row_operator_tests_utilities.cu (#14826) @davidwendt
  • Remove -DNVBench_ENABLE_CUPTI=OFF. (#14820) @bdice
  • Use cuco::static_set in the hash-based groupby (#14813) @PointKernel
  • Branch 24.04 merge branch 24.02 (#14809) @vyasr
  • Branch 24.04 merge branch 24.02 (#14806) @vyasr
  • Introduce basic "cudf" backend for Dask Expressions (#14805) @rjzamora
  • Remove build_struct|list_column (#14786) @mroeschke
  • Use offsetalator in nvtext tokenize functions (#14783) @davidwendt
  • Reduce execution time of Python ORC tests (#14776) @vuule
  • Use offsetalator in cudf::strings::split functions (#14757) @davidwendt
  • Use offsetalator in cudf::strings::findall (#14745) @davidwendt
  • Use offsetalator in cudf::strings::url_decode (#14744) @davidwendt
  • Use get_offset_value utility in strings shift function (#14743) @davidwendt
  • Use as_column instead of full (#14698) @mroeschke
  • List all notable breaking changes (#13535) @galipremsagar

v23.12.01

4 months ago

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

  • Fix synchronization issue when writing string columns with dictionary to ORC (#14595) @vuule
  • Update actions/labeler to v4 (#14562) @raydouglass
  • Fix data corruption when skipping rows (#14557) @etseidl
  • Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
  • Fix intermediate type checking in expression parsing (#14445) @vyasr
  • Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
  • Remove needs: wheel-build-cudf. (#14427) @bdice
  • Fix dask dependency in custreamz (#14420) @vyasr
  • Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
  • Support java AST String literal with desired encoding (#14402) @winningsix
  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
  • Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
  • Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
  • cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
  • Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
  • Add the new manylinux builds to the build job (#14351) @vyasr
  • cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
  • Fix overflow check in cudf::merge (#14345) @divyegala
  • Add cramjam (#14344) @vyasr
  • Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
  • Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
  • Fix host buffer access from device function in the Parquet reader (#14328) @vuule
  • Run IO tests for Dask-cuDF (#14327) @rjzamora
  • Fix logical type issues in the Parquet writer (#14322) @vuule
  • Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
  • test is_valid before reading column data (#14318) @etseidl
  • Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
  • Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
  • Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
  • fixing thread index overflow issue (#14290) @hyperbolic2346
  • Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
  • Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
  • Handle empty string correctly in Parquet statistics (#14257) @etseidl
  • Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
  • cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
  • Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
  • Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
  • Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

  • Fix io reference in docs. (#14452) @bdice
  • Update README (#14374) @shwina
  • Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

  • Expose streams in public unary APIs (#14342) @vyasr
  • Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
  • Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
  • Expose streams in public null mask APIs (#14263) @vyasr
  • Expose streams in binaryop APIs (#14187) @vyasr
  • Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
  • Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
  • Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
  • Add BytePairEncoder class to cuDF (#13891) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule
  • Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

  • Build concurrency for nightly and merge triggers (#14441) @bdice
  • Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
  • Update to Arrow 14.0.1. (#14387) @bdice
  • Remove Cython libcpp wrappers (#14382) @vyasr
  • Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
  • Upgrade to arrow 14 (#14371) @galipremsagar
  • Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
  • Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
  • Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
  • Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
  • Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
  • Added streams to CSV reader and writer api (#14340) @shrshi
  • Upgrade wheels to use arrow 13 (#14339) @vyasr
  • Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
  • Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
  • Upgrade arrow to 13 (#14330) @galipremsagar
  • Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
  • Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
  • Avoid pyarrow.fs import for local storage (#14321) @rjzamora
  • Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
  • Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
  • Added streams to JSON reader and writer api (#14313) @shrshi
  • Minor improvements in source_info (#14308) @vuule
  • Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
  • Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
  • Expose stream parameter in public strings filter APIs (#14293) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Update shared-action-workflows references (#14289) @AyodeAwe
  • Register partd encode dispatch in dask_cudf (#14287) @rjzamora
  • Update versioning strategy (#14285) @vyasr
  • Move and rename byte-pair-encoding source files (#14284) @davidwendt
  • Expose stream parameter in public strings combine APIs (#14281) @davidwendt
  • Expose stream parameter in public strings contains APIs (#14280) @davidwendt
  • Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
  • Use branch-23.12 workflows. (#14271) @bdice
  • Refactor LogicalType for Parquet (#14264) @etseidl
  • Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
  • Expose stream parameter in public strings replace APIs (#14261) @davidwendt
  • Expose stream parameter in public strings APIs (#14260) @davidwendt
  • Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
  • Make parquet schema index type consistent (#14256) @hyperbolic2346
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Add in java bindings for DataSource (#14254) @revans2
  • Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
  • Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
  • Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
  • Improve contains_column by invoking contains_table (#14238) @PointKernel
  • Detect and report errors in Parquet header parsing (#14237) @etseidl
  • Normalizing offsets iterator (#14234) @davidwendt
  • Forward merge 23.10 into 23.12 (#14231) @galipremsagar
  • Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
  • Enable indexalator for device code (#14206) @davidwendt
  • Marginally reduce memory footprint of joins (#14197) @wence-
  • Add nvtx annotations to spilling-based data movement (#14196) @wence-
  • Optimize ORC writer for decimal columns (#14190) @vuule
  • Remove the use of volatile in ORC (#14175) @vuule
  • Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
  • Add bytes_per_second to transpose benchmark (#14170) @Blonck
  • cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
  • Add bytes_per_second to shift benchmark (#13950) @Blonck
  • Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

v23.12.00

5 months ago

🚨 Breaking Changes

  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

  • Update actions/labeler to v4 (#14562) @raydouglass
  • Fix data corruption when skipping rows (#14557) @etseidl
  • Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
  • Fix intermediate type checking in expression parsing (#14445) @vyasr
  • Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
  • Remove needs: wheel-build-cudf. (#14427) @bdice
  • Fix dask dependency in custreamz (#14420) @vyasr
  • Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
  • Support java AST String literal with desired encoding (#14402) @winningsix
  • Raise error in reindex when index is not unique (#14400) @galipremsagar
  • Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
  • Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
  • Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
  • cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
  • Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
  • Add the new manylinux builds to the build job (#14351) @vyasr
  • cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
  • Fix overflow check in cudf::merge (#14345) @divyegala
  • Add cramjam (#14344) @vyasr
  • Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
  • Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
  • Fix host buffer access from device function in the Parquet reader (#14328) @vuule
  • Run IO tests for Dask-cuDF (#14327) @rjzamora
  • Fix logical type issues in the Parquet writer (#14322) @vuule
  • Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
  • test is_valid before reading column data (#14318) @etseidl
  • Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
  • Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
  • Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
  • fixing thread index overflow issue (#14290) @hyperbolic2346
  • Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
  • Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
  • Handle empty string correctly in Parquet statistics (#14257) @etseidl
  • Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
  • cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
  • Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
  • Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
  • Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

  • Fix io reference in docs. (#14452) @bdice
  • Update README (#14374) @shwina
  • Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

  • Expose streams in public unary APIs (#14342) @vyasr
  • Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
  • Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
  • Expose streams in public null mask APIs (#14263) @vyasr
  • Expose streams in binaryop APIs (#14187) @vyasr
  • Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
  • Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
  • Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
  • Add BytePairEncoder class to cuDF (#13891) @davidwendt
  • Upgrade to nvCOMP 3.0.4 (#13815) @vuule
  • Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

  • Build concurrency for nightly and merge triggers (#14441) @bdice
  • Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
  • Update to Arrow 14.0.1. (#14387) @bdice
  • Remove Cython libcpp wrappers (#14382) @vyasr
  • Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
  • Upgrade to arrow 14 (#14371) @galipremsagar
  • Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
  • Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
  • Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
  • Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
  • Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
  • Added streams to CSV reader and writer api (#14340) @shrshi
  • Upgrade wheels to use arrow 13 (#14339) @vyasr
  • Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
  • Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
  • Upgrade arrow to 13 (#14330) @galipremsagar
  • Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
  • Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
  • Avoid pyarrow.fs import for local storage (#14321) @rjzamora
  • Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
  • Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
  • Added streams to JSON reader and writer api (#14313) @shrshi
  • Minor improvements in source_info (#14308) @vuule
  • Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
  • Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
  • Expose stream parameter to get_json_object API (#14297) @davidwendt
  • Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
  • Expose stream parameter in public strings filter APIs (#14293) @davidwendt
  • Refactor cudf_kafka to use skbuild (#14292) @jdye64
  • Update shared-action-workflows references (#14289) @AyodeAwe
  • Register partd encode dispatch in dask_cudf (#14287) @rjzamora
  • Update versioning strategy (#14285) @vyasr
  • Move and rename byte-pair-encoding source files (#14284) @davidwendt
  • Expose stream parameter in public strings combine APIs (#14281) @davidwendt
  • Expose stream parameter in public strings contains APIs (#14280) @davidwendt
  • Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
  • Use branch-23.12 workflows. (#14271) @bdice
  • Refactor LogicalType for Parquet (#14264) @etseidl
  • Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
  • Expose stream parameter in public strings replace APIs (#14261) @davidwendt
  • Expose stream parameter in public strings APIs (#14260) @davidwendt
  • Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
  • Make parquet schema index type consistent (#14256) @hyperbolic2346
  • Expose stream parameter in public strings convert APIs (#14255) @davidwendt
  • Add in java bindings for DataSource (#14254) @revans2
  • Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
  • Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
  • Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
  • Improve contains_column by invoking contains_table (#14238) @PointKernel
  • Detect and report errors in Parquet header parsing (#14237) @etseidl
  • Normalizing offsets iterator (#14234) @davidwendt
  • Forward merge 23.10 into 23.12 (#14231) @galipremsagar
  • Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
  • Enable indexalator for device code (#14206) @davidwendt
  • Marginally reduce memory footprint of joins (#14197) @wence-
  • Add nvtx annotations to spilling-based data movement (#14196) @wence-
  • Optimize ORC writer for decimal columns (#14190) @vuule
  • Remove the use of volatile in ORC (#14175) @vuule
  • Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
  • Add bytes_per_second to transpose benchmark (#14170) @Blonck
  • cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
  • Add bytes_per_second to shift benchmark (#13950) @Blonck
  • Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

v24.02.00a

5 months ago

🚨 Breaking Changes

  • Remove **kwargs from astype (#14765) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Drop Pascal GPU support. (#14630) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • Switch to scikit-build-core (#13531) @vyasr

🐛 Bug Fixes

  • Fixed an issue with output chunking computation stemming from input chunking. (#14889) @nvdbaranec
  • Fix total_byte_size in Parquet row group metadata (#14802) @etseidl
  • Fix index difference to follow the pandas format (#14789) @amiralimi
  • Fix shared-workflows repo name (#14784) @raydouglass
  • Remove unparseable attributes from all nodes (#14780) @vyasr
  • Refactor and add validation to IntervalIndex.init (#14778) @mroeschke
  • Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer (#14772) @etseidl
  • Fix calls to deprecated strings factory API (#14771) @davidwendt
  • Fix ptx file discovery in editable installs (#14767) @vyasr
  • Revise shuffle deprecation to align with dask/dask (#14762) @rjzamora
  • Enable intermediate proxies to be picklable (#14752) @shwina
  • Add CUDF_TEST_PROGRAM_MAIN macro to tests lacking it (#14751) @etseidl
  • Fix CMake args (#14746) @vyasr
  • Fix logic bug introduced in #14730 (#14742) @wence-
  • [Java] Choose The Correct RoundingMode For Checking Decimal OutOfBounds (#14731) @razajafri
  • Fix Groupby.get_group (#14728) @rjzamora
  • Ensure that all CUDA kernels in cudf have hidden visibility. (#14726) @robertmaynard
  • Split cuda versions for notebook testing (#14722) @raydouglass
  • Fix to_numeric not preserving Series index and name (#14718) @mroeschke
  • Update dask-cudf wheel name (#14713) @raydouglass
  • Fix strings::contains matching end of string target (#14711) @davidwendt
  • Update to Dask's shuffle_method kwarg (#14708) @pentschev
  • Write file-level statistics when writing ORC files with zero rows (#14707) @vuule
  • Potential fix for peformance regression in #14415 (#14706) @etseidl
  • Ensure DataFrame column types are preserved during serialization (#14705) @mroeschke
  • Skip numba test that fails on ARM (#14702) @brandon-b-miller
  • Allow Z in datetime string parsing in non pandas compat mode (#14701) @mroeschke
  • Fix nan_as_null not being respected when passing arrow object (#14688) @mroeschke
  • Fix constructing Series/Index from arrow array and dtype (#14686) @mroeschke
  • Fix Aggregation Type Promotion: Ensure Unsigned Input Types Result in Unsigned Output for Sum and Multiply (#14679) @SurajAralihalli
  • Add BaseOffset as a final proxy type to pass instancechecks for offsets against BaseOffset (#14678) @shwina
  • Add row conversion code from spark-rapids-jni (#14664) @ttnghia
  • Unconditionally export the CCCL path (#14656) @vyasr
  • Ensure libcudf searches for our patched version of CCCL first (#14655) @robertmaynard
  • Constrain CUDA in notebook testing to prevent CUDA 12.1 usage until we have pynvjitlink (#14648) @vyasr
  • Fix invalid memory access in Parquet reader (#14637) @etseidl
  • Use column_empty over as_column([]) (#14632) @mroeschke
  • Add (implicit) handling for torch tensors in is_scalar (#14623) @wence-
  • Fix astype/fillna not maintaining column subclass and types (#14615) @mroeschke
  • Remove non-empty nulls in cudf::get_json_object (#14609) @davidwendt
  • Remove cuda::proclaim_return_type from nested lambda (#14607) @ttnghia
  • Fix DataFrame.reindex when column reindexing to MultiIndex/RangeIndex (#14605) @mroeschke
  • Address potential race conditions in Parquet reader (#14602) @etseidl
  • Fix DataFrame.reindex removing column name (#14601) @mroeschke
  • Remove unsanitized input test data from copy gtests (#14600) @davidwendt
  • Fix race detected in Parquet writer (#14598) @etseidl
  • Correct invalid or missing return types (#14587) @robertmaynard
  • Fix unsanitized nulls from strings segmented-reduce (#14586) @davidwendt
  • Upgrade to nvCOMP 3.0.5 (#14581) @davidwendt
  • Fix unsanitized nulls produced by cudf::clamp APIs (#14580) @davidwendt
  • Fix unsanitized nulls produced by libcudf dictionary decode (#14578) @davidwendt
  • Fixes a symbol group lookup table issue (#14561) @elstehle
  • Drop llvm16 from cuda118-conda devcontainer image (#14526) @charlesbluca
  • REF: Make DataFrame.from_pandas process by column (#14483) @mroeschke
  • Improve memory footprint of isin by using contains (#14478) @wence-
  • Move creation of env.yaml outside the current directory (#14476) @davidwendt
  • Enable pd.Timestamp objects to be picklable when cudf.pandas is active (#14474) @shwina
  • Correct dtype of count aggregations on empty dataframes (#14473) @wence-
  • Avoid DataFrame conversion in MultiIndex.from_pandas (#14470) @mroeschke
  • JSON writer: avoid default stream use in string_scalar constructors (#14444) @vuule
  • Fix default stream use in the CSV reader (#14443) @vuule
  • Preserve DataFrame(columns=).columns dtype during empty-like construction (#14381) @mroeschke
  • Defer PTX file load to runtime (#13690) @brandon-b-miller

📖 Documentation

  • Disable parallel build (#14796) @vyasr
  • Add pylibcudf to the docs (#14791) @vyasr
  • Describe unpickling expectations when cudf.pandas is enabled (#14693) @shwina
  • Update CONTRIBUTING for pyproject-only builds (#14653) @vyasr
  • More doxygen fixes (#14639) @vyasr
  • Enable doxygen XML generation and fix issues (#14477) @vyasr
  • Some doxygen improvements (#14469) @vyasr
  • Remove warning in dask-cudf docs (#14454) @wence-
  • Update README links with redirects. (#14378) @bdice
  • Add pip install instructions to README (#13677) @shwina

🚀 New Features

  • Add ci check for external kernels (#14768) @robertmaynard
  • JSON single quote normalization API (#14729) @shrshi
  • Write cuDF version in Parquet "created_by" metadata field (#14721) @etseidl
  • Implement remaining copying APIs in pylibcudf along with required helper functions (#14640) @vyasr
  • Don't constrain numba&lt;0.58 (#14616) @brandon-b-miller
  • Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet (#14590) @etseidl
  • JSON - Parse mixed types as string in JSON reader (#14572) @karthikeyann
  • JSON quote normalization (#14545) @shrshi
  • Make DefaultHostMemoryAllocator settable (#14523) @gerashegalov
  • Implement more copying APIs in pylibcudf (#14508) @vyasr
  • Include writer code and writerVersion in ORC files (#14458) @vuule
  • Parquet sub-rowgroup reading. (#14360) @nvdbaranec
  • Move chars column to parent data buffer in strings column (#14202) @karthikeyann
  • PARQUET-2261 Size Statistics (#14000) @etseidl
  • Improve GroupBy JIT error handling (#13854) @brandon-b-miller
  • Generate unified Python/C++ docs (#13846) @vyasr
  • Expand JIT groupby test suite (#13813) @brandon-b-miller

🛠️ Improvements

  • Move cudf::char_utf8 definition from detail to public header (#14779) @davidwendt
  • Clean up TimedeltaIndex.__init__ constructor (#14775) @mroeschke
  • Clean up DatetimeIndex.__init__ constructor (#14774) @mroeschke
  • Some frame.py typing, move seldom used methods in frame.py (#14766) @mroeschke
  • Remove **kwargs from astype (#14765) @mroeschke
  • fix benchmarks compatibility with newer pytest-cases (#14764) @jameslamb
  • Add pynvjitlink as a dependency (#14763) @brandon-b-miller
  • Resolve degenerate performance in create_structs_data (#14761) @SurajAralihalli
  • Simplify ColumnAccessor methods; avoid unnecessary validations (#14758) @mroeschke
  • Pin pytest-cases<3.8.2 (#14756) @mroeschke
  • Use _from_data instead of _from_columns for initialzing Frame (#14755) @mroeschke
  • Consolidate cudf object handling in as_column (#14754) @mroeschke
  • Reduce execution time of Parquet C++ tests (#14750) @vuule
  • Implement to_datetime(..., utc=True) (#14749) @mroeschke
  • Remove usages of rapids-env-update (#14748) @KyleFromNVIDIA
  • Provide explicit pool size and avoid RMM detail APIs (#14741) @harrism
  • Implement cudf.MultiIndex.from_arrays (#14740) @mroeschke
  • Remove unused/single use methods (#14739) @mroeschke
  • refactor CUDA versions in dependencies.yaml (#14733) @jameslamb
  • Remove unneeded methods in Column (#14730) @mroeschke
  • Clean up base column methods (#14725) @mroeschke
  • Ensure column.fillna signatures are consistent (#14724) @mroeschke
  • Remove mimesis as a testing dependency (#14723) @mroeschke
  • Replace as_numerical with as_numerical_column/codes (#14719) @mroeschke
  • Use offsetalator in gather_chars (#14700) @davidwendt
  • Use make_strings_children for fill() specialization logic (#14697) @davidwendt
  • Change io::detail::orc namespace into io::orc::detail (#14696) @ttnghia
  • Fix call to deprecated factory function (#14695) @davidwendt
  • Use as_column instead of arange for range like inputs (#14689) @mroeschke
  • Reorganize ORC reader into multiple files and perform some small fixes to cuIO code (#14665) @ttnghia
  • Split parquet test into multiple files (#14663) @etseidl
  • Custom error messages for IO with nonexistent files (#14662) @vuule
  • Explicitly pass .dtype into is_foo_dtype functions (#14657) @mroeschke
  • Basic validation in reader benchmarks (#14647) @vuule
  • Update dependencies.yaml to support CUDA 12.*. (#14644) @bdice
  • Consolidate memoryview handling in as_column (#14643) @mroeschke
  • Convert FieldType to scoped enum (#14642) @vuule
  • Use instance over is_foo_dtype (#14641) @mroeschke
  • Use isinstance over is_foo_dtype internally (#14638) @mroeschke
  • Remove unnecessary **kwargs in function signatures (#14635) @mroeschke
  • Drop nvbench patch for nvml. (#14631) @bdice
  • Drop Pascal GPU support. (#14630) @bdice
  • Add cpp/doxygen/xml to .gitignore (#14613) @davidwendt
  • Create strings-specific make_offsets_child_column for multiple offset types (#14612) @davidwendt
  • Use the offsetalator in cudf::concatenate for strings (#14611) @davidwendt
  • Make Parquet ColumnIndex null_counts optional (#14596) @etseidl
  • Support freq in DatetimeIndex (#14593) @shwina
  • Remove legacy benchmarks for cuDF-python (#14591) @osidekyle
  • Remove WORKSPACE env var from cudf_test temp_directory class (#14588) @davidwendt
  • Use exceptions instead of return values to handle errors in CompactProtocolReader (#14582) @vuule
  • Use cuda::proclaim_return_type on device lambdas. (#14577) @bdice
  • Update to CCCL 2.2.0. (#14576) @bdice
  • Update dependencies.yaml to new pip index (#14575) @vyasr
  • Simplify Python CMake (#14565) @vyasr
  • Java expose parquet pass_read_limit (#14564) @revans2
  • Add column sanitization checks in CUDF_TEST_EXPECT_COLUMN_* macros (#14559) @SurajAralihalli
  • Use cudf_test temp_directory class for nvtext::subword_tokenize gbenchmark (#14558) @davidwendt
  • Fix return type of prefix increment overloads (#14544) @vuule
  • Make bpe_merge_pairs_impl member private (#14543) @davidwendt
  • Small clean up in io::statistics (#14542) @vuule
  • Change json gtest environment variable to compile-time definition (#14541) @davidwendt
  • Remove extra total chars size calculation from cudf::concatenate (#14540) @davidwendt
  • Refactor IndexedFrame.hash_values to use cudf::hashing functions, add xxhash64 to cudf Python. (#14538) @bdice
  • Move non-templated inline function definitions from table_view.hpp to table_view.cpp (#14535) @davidwendt
  • Add JNI for strings::code_points (#14533) @thirtiseven
  • Add a test for issue 12773 (#14529) @vyasr
  • Split libarrow build dependencies. (#14506) @bdice
  • Implement IndexedFrame.duplicated with distinct_indices + scatter (#14493) @wence-
  • Expunge as_frame conversions in Column algorithms (#14491) @wence-
  • Remove unsanitized null from input strings column in rank_tests.cpp (#14475) @davidwendt
  • Refactor Parquet kernel_error (#14464) @etseidl
  • Deprecate cudf::make_strings_column accepting typed offsets (#14461) @davidwendt
  • Remove deprecated nvtext::load_merge_pairs_file (#14460) @davidwendt
  • Introduce Comprehensive Pathological Unit Tests for Issue #14409 (#14459) @aocsa
  • Expose stream parameter in public nvtext APIs (#14456) @davidwendt
  • Include encode type in the error message when unsupported Parquet encoding is detected (#14453) @ZelboK
  • Remove null mask for zero nulls in json readers (#14451) @karthikeyann
  • Refactor cudf.Series.init (#14450) @mroeschke
  • Remove the use of volatile in Parquet (#14448) @vuule
  • REF: Remove **kwargs from to_pandas, raise if nullable is not implemented (#14438) @mroeschke
  • Testing stream pool implementation (#14437) @shrshi
  • Match pandas join ordering obligations in pandas-compatible mode (#14428) @wence-
  • Forward-merge branch-23.12 to branch-24.02 (#14426) @bdice
  • Use isinstance(..., cudf.IntervalDtype) instead of is_interval_dtype (#14424) @mroeschke
  • Use isinstance(..., cudf.CategoricalDtype) instead of is_categorical_dtype (#14423) @mroeschke
  • Forward-merge branch-23.12 to branch-24.02 (#14422) @bdice
  • REF: Remove instances of pd.core (#14421) @mroeschke
  • Expose streams in public filling APIs for label_bins (#14401) @ZelboK
  • Consolidate 1D pandas object handling in as_column (#14394) @mroeschke
  • Limit DELTA_BINARY_PACKED encoder to the same number of bits as the physical type being encoded (#14392) @etseidl
  • Add SHA-1 and SHA-2 hash functions. (#14391) @bdice
  • Expose streams in Parquet reader and writer APIs (#14359) @shrshi
  • Update to fmt 10.1.1 and spdlog 1.12.0. (#14355) @bdice
  • Replace default stream for scalars and column factories usages (because of defaulted arguments) (#14354) @karthikeyann
  • Expose streams in ORC reader and writer APIs (#14350) @shrshi
  • Convert compression and io to string axis type in IO benchmarks (#14347) @SurajAralihalli
  • Add cuDF devcontainers (#14015) @trxcllnt
  • Refactoring of Buffers (last step towards unifying COW and Spilling) (#13801) @madsbk
  • Switch to scikit-build-core (#13531) @vyasr
  • Simplify null count checking in column equality comparator (#13312) @vyasr