Cudf Versions Save

cuDF - GPU DataFrame Library

v23.10.02

6 months ago

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Raise error in reindex when index is not unique (#14429) @galipremsagar
Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix benchmark image. (#14376) @bdice
Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

v23.04.01

6 months ago

🚨 Breaking Changes

Pin dask and distributed for release (#13070) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate Index.is_* methods (#12820) @galipremsagar
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Make string methods return a Series with a useful Index (#12814) @shwina
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Replace message parsing with throwing more specific exceptions (#12426) @vyasr

🐛 Bug Fixes

Pin curand version (#13127) @vyasr
Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
Fix DataFrame constructor to broadcast scalar inputs properly (#12997) @galipremsagar
Drop force_nullable_schema from chunked parquet writer (#12996) @galipremsagar
Fix gtest column utility comparator diff reporting (#12995) @davidwendt
Handle index names while performing groupby (#12992) @galipremsagar
Fix __setitem__ on string columns when the scalar value ends in a null byte (#12991) @wence-
Fix sort_values when column is all empty strings (#12988) @eriknw
Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#12983) @rjzamora
Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
cudftestutil supports static gtest dependencies (#12957) @robertmaynard
Include gtest in build environment. (#12956) @vyasr
Correctly handle scalar indices in Index.__getitem__ (#12955) @wence-
Avoid building cython twice (#12945) @galipremsagar
Fix set index error for Series rolling window operations (#12942) @galipremsagar
Fix calculation of null counts for Parquet statistics (#12938) @etseidl
Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
Fix conda recipe post-link.sh typo (#12916) @pentschev
min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
Use python -m pytest for nightly wheel tests (#12871) @bdice
Parquet writer column_size() should return a size_t (#12870) @etseidl
Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
Remove tokenizers pre-install pinning. (#12854) @vyasr
Fix parquet RangeIndex bug (#12838) @rjzamora
Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
Make string methods return a Series with a useful Index (#12814) @shwina
Tell cudf_kafka to use header-only fmt (#12796) @vyasr
Add GroupBy.dtypes (#12783) @galipremsagar
Fix a leak in a test and clarify some test names (#12781) @revans2
Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
Fix a bug with num_keys in _scatter_by_slice (#12749) @thomcom
Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
Add always_nullable flag to Dremel encoding (#12727) @divyegala
Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
Fix faulty conditional logic in JIT GroupBy.apply (#12706) @brandon-b-miller
Produce useful guidance on overflow error in to_csv (#12705) @wence-
Handle parquet list data corner case (#12698) @nvdbaranec
Fix missing trailing comma in json writer (#12688) @karthikeyann
Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
Handle bool types in round API (#12670) @galipremsagar
Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
Fix from_arrow to load a sliced arrow table (#12665) @galipremsagar
Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
Fix find_common_dtype and values to handle complex dtypes (#12537) @galipremsagar
Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
Fix Series comparison vs scalars (#12519) @brandon-b-miller
Allow casting from UDFString back to StringView to call methods in strings_udf (#12363) @brandon-b-miller

📖 Documentation

Fix GroupBy.apply doc examples rendering (#12994) @brandon-b-miller
add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
Add README symlink for dask-cudf. (#12946) @bdice
Remove return type from @return doxygen tags (#12908) @davidwendt
Fix docs build to be pydata-sphinx-theme=0.13.0 compatible (#12874) @galipremsagar
Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
Enable doctests for GroupBy methods (#12658) @brandon-b-miller
Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt

🚀 New Features

Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
Refactor orc chunked writer (#12949) @ttnghia
Make Parquet writer nullable option application to single table writes (#12933) @vuule
Refactor io::orc::ProtobufWriter (#12877) @ttnghia
Make timezone table independent from ORC (#12805) @vuule
Cache JIT GroupBy.apply functions (#12802) @brandon-b-miller
Implement initial support for avro logical types (#6482) (#12788) @tpn
Update tests/column_utilities to use experimental::equality row comparator (#12777) @divyegala
Update distinct/unique_count to experimental::row hasher/comparator (#12776) @divyegala
Update hash_partition to use experimental::row::row_hasher (#12761) @divyegala
Update is_sorted to use experimental::row::lexicographic (#12752) @divyegala
Update default data source in cuio reader benchmarks (#12740) @PointKernel
Reenable stream identification library in CI (#12714) @vyasr
Add regex_program strings splitting java APIs and tests (#12713) @cindyyuanjiang
Add regex_program strings replacing java APIs and tests (#12701) @cindyyuanjiang
Add regex_program strings extract java APIs and tests (#12699) @cindyyuanjiang
Variable fragment sizes for Parquet writer (#12685) @etseidl
Add segmented reduction support for fixed-point types (#12680) @davidwendt
Move strings_udf code into cuDF (#12669) @brandon-b-miller
Add regex_program searching APIs and related java classes (#12666) @cindyyuanjiang
Add logging to libcudf (#12637) @vuule
Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
Convert rank to use to experimental row comparators (#12481) @divyegala
Use rapids-cmake parallel testing feature (#12451) @robertmaynard
Enable detection of undesired stream usage (#12089) @vyasr

🛠️ Improvements

Pin dask and distributed for release (#13070) @galipremsagar
Pin cupy in wheel tests to supported versions (#13041) @vyasr
Pin numba version (#13001) @vyasr
Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
Stop setting package version attribute in wheels (#12977) @vyasr
Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
Remove default detail mrs: part7 (#12970) @vyasr
Remove default detail mrs: part6 (#12969) @vyasr
Remove default detail mrs: part5 (#12968) @vyasr
Remove default detail mrs: part4 (#12967) @vyasr
Remove default detail mrs: part3 (#12966) @vyasr
Remove default detail mrs: part2 (#12965) @vyasr
Remove default detail mrs: part1 (#12964) @vyasr
Add force_nullable_schema parameter to Parquet writer. (#12952) @galipremsagar
Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
Remove remaining default stream parameters (#12943) @vyasr
Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
Implement groupby.head and groupby.tail (#12939) @wence-
Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
Pass SCCACHE_S3_USE_SSL to conda builds (#12910) @ajschmidt8
Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
Generate pyproject dependencies using dfg (#12906) @vyasr
Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
Fix moto env vars & pass AWS_SESSION_TOKEN to conda builds (#12902) @ajschmidt8
Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
Deprecate line_terminator in favor of lineterminator in to_csv (#12896) @wence-
Add stream and mr parameters for structs::detail::flatten_nested_columns (#12892) @ttnghia
Deprecate libcudf regex APIs accepting pattern strings directly (#12891) @davidwendt
Remove default parameters from detail headers in include (#12888) @vyasr
Update minimum pandas and numpy pinnings (#12887) @galipremsagar
Implement groupby.sample (#12882) @wence-
Update JNI build ENV default to gcc 11 (#12881) @pxLi
Change return type of cudf::structs::detail::flatten_nested_columns to smart pointer (#12878) @ttnghia
Fix passing seed parameter to MurmurHash3_32 in cudf::hash() function (#12875) @davidwendt
Remove manual artifact upload step in CI (#12869) @ajschmidt8
Update to GCC 11 (#12868) @bdice
Fix null hive-partition behavior in dask-cudf parquet (#12866) @rjzamora
Update to protobuf>=4.21.6,<4.22. (#12864) @bdice
Update RMM allocators (#12861) @pentschev
Improve performance for replace-multi for long strings (#12858) @davidwendt
Drop Python 3.7 handling for pickle protocol 4 (#12857) @jakirkham
Migrate as much as possible to pyproject.toml (#12850) @vyasr
Enable nbqa pre-commit hooks for isort and black. (#12848) @bdice
Setting a threshold for KvikIO IO (#12841) @madsbk
Update datasets download URL (#12840) @jjacobelli
Make docs builds less verbose (#12836) @AyodeAwe
Consolidate linter configs into pyproject.toml (#12834) @vyasr
Deprecate names & dtype in Index.copy (#12825) @galipremsagar
Deprecate inplace parameters in categorical methods (#12824) @galipremsagar
Add optional text file support to ninja-log utility (#12823) @davidwendt
Deprecate Index.is_* methods (#12820) @galipremsagar
Add dfg as a pre-commit hook (#12819) @vyasr
Deprecate datetime_is_numeric from describe (#12818) @galipremsagar
Deprecate na_sentinel in factorize (#12817) @galipremsagar
Shuffling read into a sub function in parquet read (#12809) @hyperbolic2346
Fixing parquet coalescing of reads (#12808) @hyperbolic2346
CI: Remove specification of manual stage for check_style.sh script. (#12803) @csadorf
Add compute-sanitizer github workflow action to nightly tests (#12800) @davidwendt
Enable groupby std and variance aggregation types in libcudf Debug build (#12799) @davidwendt
Expose seed argument to hash_values (#12795) @ayushdg
Fix groupby gtests coded in namespace cudf::test (#12784) @davidwendt
Improve performance for cudf::strings::count_characters for long strings (#12779) @davidwendt
Deallocate encoded data in ORC writer immediately after compression (#12770) @vuule
Stop force pulling fmt in nvbench. (#12768) @vyasr
Remove now redundant cuda initialization (#12758) @vyasr
Adds JSON reader, writer io benchmark (#12753) @karthikeyann
Use test paths relative to package directory. (#12751) @bdice
Add build metrics report as artifact to cpp-build workflow (#12750) @davidwendt
Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT (#12742) @razajafri
Stop using versioneer to manage versions (#12741) @vyasr
Reduce error handling verbosity in CI tests scripts (#12738) @AjayThorve
Reduce the number of test cases in multibyte_split benchmark (#12737) @PointKernel
Update shared workflow branches (#12733) @ajschmidt8
JNI switches to nested JSON reader (#12732) @res-life
Changing cudf::io::source_info to use cudf::host_span<std::byte> in a non-breaking form (#12730) @hyperbolic2346
Add nvbench environment class for initializing RMM in benchmarks (#12728) @davidwendt
Split C++ and Python build dependencies into separate lists. (#12724) @bdice
Add build dependencies to Java tests. (#12723) @bdice
Allow setting the seed argument for hash partition (#12715) @firestarman
Remove gpuCI scripts. (#12712) @bdice
Unpin dask and distributed for development (#12710) @galipremsagar
partition_by_hash(): use _split() (#12704) @madsbk
Remove DataFrame.quantiles from docs. (#12684) @bdice
Fast path for experimental::row::equality (#12676) @divyegala
Move date to build string in conda recipe (#12661) @ajschmidt8
Refactor reduction logic for fixed-point types (#12652) @davidwendt
Pay off some JNI RMM API tech debt (#12632) @revans2
Merge copy-on-write feature branch into branch-23.04 (#12619) @galipremsagar
Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
Pin cuda-nvrtc. (#12606) @bdice
Remove cudf::test::print calls from libcudf gtests (#12604) @davidwendt
Init JNI version 23.04.0-SNAPSHOT (#12599) @pxLi
Add performance benchmarks to user facing docs (#12595) @galipremsagar
Add docs build job (#12592) @AyodeAwe
Replace message parsing with throwing more specific exceptions (#12426) @vyasr
Support conversion to/from cudf in dask.dataframe.core.to_backend (#12380) @rjzamora

v23.10.00a

7 months ago

🔗 Links

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Update shared-action-workflows references (backport from 23.12 to 23.10) (#14300) @AyodeAwe
Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

v23.10.00

7 months ago

🚨 Breaking Changes

Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Create table_input_metadata from a table_metadata (#13920) @etseidl
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Update to Cython 3.0.0 (#13777) @vyasr
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

🐛 Bug Fixes

Fix inaccurate ceil/floor and inaccurate rescaling casts of fixed-point values. (#14242) @bdice
Fix inaccuracy in decimal128 rounding. (#14233) @bdice
Workaround for illegal instruction error in sm90 for warp instrinsics with mask (#14201) @karthikeyann
Fix pytorch related pytest (#14198) @galipremsagar
Pin to aws-sdk-cpp<1.11 (#14173) @pentschev
Fix assert failure for range window functions (#14168) @mythrocks
Fix Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings (#14164) @karthikeyann
Fix calls to copy_bitmask to pass stream parameter (#14158) @davidwendt
Fix DataFrame from Series with different CategoricalIndexes (#14157) @mroeschke
Pin to numpy<1.25 and numba<0.58 to avoid errors and deprecation warnings-as-errors. (#14156) @bdice
Fix kernel launch error for cudf::io::orc::gpu::rowgroup_char_counts_kernel (#14139) @davidwendt
Don't sort columns for DataFrame init from list of Series (#14136) @mroeschke
Fix DataFrame.values with no columns but index (#14134) @mroeschke
Avoid circular cimports in _lib/cpp/reduce.pxd (#14125) @vyasr
Add support for nested dict in DataFrame constructor (#14119) @galipremsagar
Restrict iterables of DataFrame's as input to DataFrame constructor (#14118) @galipremsagar
Allow numeric_only=True for reduction operations on numeric types (#14111) @galipremsagar
Preserve name of the column while initializing a DataFrame (#14110) @galipremsagar
Correct numerous 20054-D: dynamic initialization errors found on arm+12.2 (#14108) @robertmaynard
Drop kwargs from Series.count (#14106) @galipremsagar
Fix naming issues with Index.to_frame and MultiIndex.to_frame APIs (#14105) @galipremsagar
Only use memory resources that haven't been freed (#14103) @robertmaynard
Add support for __round__ in Series and DataFrame (#14099) @galipremsagar
Validate ignore_index type in drop_duplicates (#14098) @mroeschke
Fix renaming Series and Index (#14080) @galipremsagar
Raise NotImplementedError in to_datetime if Z (or tz component) in string (#14074) @mroeschke
Raise NotImplementedError for datetime strings with UTC offset (#14070) @mroeschke
Update pyarrow-related dispatch logic in dask_cudf (#14069) @rjzamora
Use conda mambabuild rather than mamba mambabuild (#14067) @wence-
Raise NotImplementedError in to_datetime with dayfirst without infer_format (#14058) @mroeschke
Fix various issues in Index.intersection (#14054) @galipremsagar
Fix Index.difference to match with pandas (#14053) @galipremsagar
Fix empty string column construction (#14052) @galipremsagar
Fix IntervalIndex.union to preserve type-metadata (#14051) @galipremsagar
Raise MixedTypeError when a column of mixed-dtype is being constructed (#14050) @galipremsagar
Raise NotImplementedError for MultiIndex.to_series (#14049) @galipremsagar
Ignore compile_commands.json (#14048) @harrism
Raise TypeError for any non-parseable argument in to_datetime (#14044) @mroeschke
Raise NotImplementedError for to_datetime with z format (#14037) @mroeschke
Implement sort_remaining for sort_index (#14033) @wence-
Raise NotImplementedError for Categoricals with timezones (#14032) @mroeschke
Temporary fix Parquet metadata with empty value string being ignored from writing (#14026) @ttnghia
Preserve types of scalar being returned when possible in quantile (#14014) @galipremsagar
Fix return type of MultiIndex.difference (#14009) @galipremsagar
Raise an error when timezone subtypes are encountered in pd.IntervalDtype (#14006) @galipremsagar
Fix map column can not be non-nullable for java (#14003) @res-life
Fix name selection in Index.difference and Index.intersection (#13986) @galipremsagar
Restore column type metadata with dropna to fix factorize API (#13980) @galipremsagar
Use thread_index_type to avoid out of bounds accesses in conditional joins (#13971) @vyasr
Fix MultiIndex.to_numpy to return numpy array with tuples (#13966) @galipremsagar
Use cudf::thread_index_type in get_json_object and tdigest kernels (#13962) @nvdbaranec
Fix an issue with IntervalIndex.repr when null values are present (#13958) @galipremsagar
Fix type metadata issue preservation with Column.unique (#13957) @galipremsagar
Handle Interval scalars when passed in list-like inputs to cudf.Index (#13956) @galipremsagar
Fix setting of categories order when dtype is passed to a CategoricalColumn (#13955) @galipremsagar
Handle as_index in GroupBy.apply (#13951) @brandon-b-miller
Raise error for string types in nsmallest and nlargest (#13946) @galipremsagar
Fix index of Groupby.apply results when it is performed on empty objects (#13944) @galipremsagar
Fix integer overflow in shim device_sum functions (#13943) @brandon-b-miller
Fix type mismatch in groupby reduction for empty objects (#13942) @galipremsagar
Fixed processed bytes calculation in APPLY_BOOLEAN_MASK benchmark. (#13937) @Blonck
Fix construction of Grouping objects (#13932) @galipremsagar
Fix an issue with loc when column names is MultiIndex (#13929) @galipremsagar
Fix handling of typecasting in searchsorted (#13925) @galipremsagar
Preserve index name in reindex (#13917) @galipremsagar
Use cudf::thread_index_type in cuIO to prevent overflow in row indexing (#13910) @vuule
Fix for encodings listed in the Parquet column chunk metadata (#13907) @etseidl
Use cudf::thread_index_type in concatenate.cu. (#13906) @bdice
Use cudf::thread_index_type in replace.cu. (#13905) @bdice
Add noSanitizer tag to Java reduction tests failing with sanitizer in CUDA 12 (#13904) @jlowe
Remove the internal use of the cudf's default stream in cuIO (#13903) @vuule
Use cuda-nvtx-dev CUDA 12 package. (#13901) @bdice
Use thread_index_type to avoid index overflow in grid-stride loops (#13895) @PointKernel
Fix memory access error in cudf::shift for sliced strings (#13894) @davidwendt
Raise error when trying to construct a DataFrame with mixed types (#13889) @galipremsagar
Return nan when one variable to be correlated has zero variance in JIT GroupBy Apply (#13884) @brandon-b-miller
Correctly detect the BOM mark in read_csv with compressed input (#13881) @vuule
Check for the presence of all values in MultiIndex.isin (#13879) @galipremsagar
Fix nvtext::generate_character_ngrams performance regression for longer strings (#13874) @davidwendt
Fix return type of MultiIndex.levels (#13870) @galipremsagar
Fix List's missing children metadata in JSON writer (#13869) @karthikeyann
Disable construction of Index when freq is set in pandas-compatibility mode (#13857) @galipremsagar
Fix an issue with fetching NA from a TimedeltaColumn (#13853) @galipremsagar
Simplify implementation of interval_range() and fix behaviour for floating freq (#13844) @shwina
Fix binary operations between Series and Index (#13842) @galipremsagar
Update make_lists_column_from_scalar to use make_offsets_child_column utility (#13841) @davidwendt
Fix read out of bounds in string concatenate (#13838) @pentschev
Raise error for more cases when timezone-aware data is passed to as_column (#13835) @galipremsagar
Fix any, all reduction behavior for axis=None and warn for other reductions (#13831) @galipremsagar
Raise error when trying to construct time-zone aware timestamps (#13830) @galipremsagar
Fix cuFile I/O factories (#13829) @vuule
DataFrame with namedtuples uses ._field as column names (#13824) @mroeschke
Branch 23.10 merge 23.08 (#13822) @vyasr
Return a Series from JIT GroupBy apply, rather than a DataFrame (#13820) @brandon-b-miller
No need to dlsym EnsureS3Finalized we can call it directly (#13819) @robertmaynard
Raise error when mixed types are being constructed (#13816) @galipremsagar
Fix unbounded sequence issue in DataFrame constructor (#13811) @galipremsagar
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs (#13807) @davidwendt
Fix for Parquet writer when requested pages per row is smaller than fragment size (#13806) @etseidl
Remove hangs from trying to construct un-bounded sequences (#13799) @galipremsagar
Bug/update libcudf to handle arrow12 changes (#13794) @robertmaynard
Update get_arrow to arrows 12 CMake target name of arrow::xsimd (#13790) @robertmaynard
Raise error when trying to join datetime and timedelta types with other types (#13786) @galipremsagar
Fix negative unary operation for boolean type (#13780) @galipremsagar
Fix contains(in) method for Series (#13779) @galipremsagar
Fix binary operation column ordering and missing column issues (#13778) @galipremsagar
Cast only time of day to nanos to avoid an overflow in Parquet INT96 write (#13776) @gerashegalov
Preserve names of column object in various APIs (#13772) @galipremsagar
Raise error on constructing an array from mixed type inputs (#13768) @galipremsagar
Fix construction of DataFrames from dict when columns are provided (#13766) @wence-
Provide our own Cython declaration for make_unique (#13746) @wence-

📖 Documentation

Fix typo in docstring: metadata. (#14025) @bdice
Fix typo in parquet/page_decode.cuh (#13849) @XinyuZeng
Simplify Python doc configuration (#13826) @vyasr
Update documentation to reflect recent changes in JSON reader and writer (#13791) @vuule
Fix all warnings in Python docs (#13789) @vyasr

🚀 New Features

[Java] Add JNI bindings for integers_to_hex (#14205) @razajafri
Propagate errors from Parquet reader kernels back to host (#14167) @vuule
JNI for HISTOGRAM and MERGE_HISTOGRAM aggregations (#14154) @ttnghia
Expose streams in all public sorting APIs (#14146) @vyasr
Enable direct ingestion and production of Arrow scalars (#14121) @vyasr
Implement GroupBy.value_counts to match pandas API (#14114) @stmio
Refactor parquet thrift reader (#14097) @etseidl
Refactor hash_reduce_by_row (#14095) @ttnghia
Support negative preceding/following for ROW window functions (#14093) @mythrocks
Support for progressive parquet chunked reading. (#14079) @nvdbaranec
Implement HISTOGRAM and MERGE_HISTOGRAM aggregations (#14045) @ttnghia
Expose streams in public search APIs (#14034) @vyasr
Expose streams in public replace APIs (#14010) @vyasr
Add stream parameter to public cudf::strings::split APIs (#13997) @davidwendt
Expose streams in public filling APIs (#13990) @vyasr
Expose streams in public concatenate APIs (#13987) @vyasr
Use HostMemoryAllocator in jni::allocate_host_buffer (#13975) @gerashegalov
Enable fractional null probability for hashing benchmark (#13967) @Blonck
Switch pylibcudf-enabled types to use enum class in Cython (#13931) @vyasr
Add nvtext::tokenize_with_vocabulary API (#13930) @davidwendt
Rewrite DataFrame.stack to support multi level column names (#13927) @isVoid
Add HostMemoryAllocator interface (#13924) @gerashegalov
Global stream pool (#13922) @etseidl
Create table_input_metadata from a table_metadata (#13920) @etseidl
Translate column size overflow exception to JNI (#13911) @mythrocks
Enable RLE boolean encoding for v2 Parquet files (#13886) @etseidl
Exclude some tests from running with the compute sanitizer (#13872) @firestarman
Expand statistics support in ORC writer (#13848) @vuule
Register the memory mapped buffer in datasource to improve H2D throughput (#13814) @vuule
Add cudf::strings::find function with target per row (#13808) @davidwendt
Add minhash support for MurmurHash3_x64_128 (#13796) @davidwendt
Remove unnecessary pointer copying in JIT GroupBy Apply (#13792) @brandon-b-miller
Add 'poll' function to custreamz kafka consumer (#13782) @jdye64
Support corr in GroupBy.apply through the jit engine (#13767) @shwina
Optionally write version 2 page headers in Parquet writer (#13751) @etseidl
Support more numeric types in Groupby.apply with engine='jit' (#13729) @brandon-b-miller
[FEA] Add DELTA_BINARY_PACKED decoding support to Parquet reader (#13637) @etseidl
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader (#13437) @PointKernel

🛠️ Improvements

Pin dask and distributed for 23.10 release (#14225) @galipremsagar
update rmm tag path (#14195) @AyodeAwe
Disable Recently Updated Check (#14193) @ajschmidt8
Move cpp/src/hash/hash_allocator.cuh to include/cudf/hashing/detail (#14163) @davidwendt
Add Parquet reader benchmarks for row selection (#14147) @vuule
Update image names (#14145) @AyodeAwe
Support callables in DataFrame.assign (#14142) @wence-
Reduce memory usage of as_categorical_column (#14138) @wence-
Replace Python scalar conversions with libcudf (#14124) @vyasr
Update to clang 16.0.6. (#14120) @bdice
Fix type of empty Index and raise warning in Series constructor (#14116) @galipremsagar
Add stream parameter to external dict APIs (#14115) @SurajAralihalli
Add fallback matrix for nvcomp. (#14082) @bdice
[Java] Add recoverWithNull to JSONOptions and pass to Table.readJSON (#14078) @andygrove
Remove header tests (#14072) @ajschmidt8
Refactor contains_table with cuco::static_set (#14064) @PointKernel
Remove debug print in a Parquet test (#14063) @vuule
Expose stream parameter in public nvtext ngram APIs (#14061) @davidwendt
Expose stream parameter in public strings find APIs (#14060) @davidwendt
Update doxygen to 1.9.1 (#14059) @vyasr
Remove the mr from the base fixture (#14057) @vyasr
Expose streams in public strings case APIs (#14056) @davidwendt
Refactor libcudf indexalator to typed normalator (#14043) @davidwendt
Use cudf::make_empty_column instead of column_view constructor (#14030) @davidwendt
Remove quadratic runtime due to accessing Frame._dtypes in loop (#14028) @wence-
Explicitly depend on zlib in conda recipes (#14018) @wence-
Use grid_stride for stride computations. (#13996) @bdice
Fix an issue where casting null-array to object dtype will result in a failure (#13994) @galipremsagar
Add tab as literal to cudf::test::to_string output (#13993) @davidwendt
Enable codes dtype parity in pandas-compatibility mode for factorize API (#13982) @galipremsagar
Fix CategoricalIndex ordering in Groupby.agg when pandas-compatibility mode is enabled (#13978) @galipremsagar
Produce a fatal error if cudf is unable to find pyarrow include directory (#13976) @cwharris
Use thread_index_type in partitioning.cu (#13973) @divyegala
Use cudf::thread_index_type in merge.cu (#13972) @divyegala
Use copy-pr-bot (#13970) @ajschmidt8
Use cudf::thread_index_type in strings custom kernels (#13968) @davidwendt
Add bytes_per_second to hash_partition benchmark (#13965) @Blonck
Added pinned pool reservation API for java (#13964) @revans2
Simplify wheel build scripts and allow alphas of RAPIDS dependencies (#13963) @vyasr
Add bytes_per_second to copy_if_else benchmark (#13960) @Blonck
Add pandas compatible output to Series.unique (#13959) @galipremsagar
Add bytes_per_second to compiled binaryop benchmark (#13938) @Blonck
Unpin dask and distributed for 23.10 development (#13935) @galipremsagar
Make HostColumnVector.getRefCount public (#13934) @abellina
Use cuco::static_set in JSON tree algorithm (#13928) @karthikeyann
Add java API to get size of host memory needed to copy column view (#13919) @revans2
Use cudf::size_type instead of int32 where appropriate in nvtext functions (#13915) @davidwendt
Enable hugepage for arrow host allocations (#13914) @madsbk
Improve performance of nvtext::edit_distance (#13912) @davidwendt
Ensure cudf internals use pylibcudf in pure Python mode (#13909) @vyasr
Use empty() instead of size() where possible (#13908) @vuule
[JNI] Adds HostColumnVector.EventHandler for spillability checks (#13898) @abellina
Return Timestamp & Timedelta for fetching scalars in DatetimeIndex & TimedeltaIndex (#13896) @galipremsagar
Allow explicit shuffle="p2p" within dask-cudf API (#13893) @rjzamora
Disable creation of DatetimeIndex when freq is passed to cudf.date_range (#13890) @galipremsagar
Bring parity with pandas for datetime & timedelta comparison operations (#13877) @galipremsagar
Change NA to NaT for datetime and timedelta types (#13868) @galipremsagar
Raise error when astype(object) is called in pandas compatibility mode (#13862) @galipremsagar
Fixes a performance regression in FST (#13850) @elstehle
Set native handles to null on close in Java wrapper classes (#13818) @jlowe
Avoid use of CUDF_EXPECTS in libcudf unit tests outside of helper functions with return values (#13812) @vuule
Update lists::contains to experimental row comparator (#13810) @divyegala
Reduce lists::contains dispatches for scalars (#13805) @divyegala
Long string optimization for string column parsing in JSON reader (#13803) @karthikeyann
Raise NotImplementedError for pd.SparseDtype (#13798) @mroeschke
Remove the libcudf cudf::offset_type type (#13788) @davidwendt
Move Spark-indpendent Table debug to cudf Java (#13783) @gerashegalov
Update to Cython 3.0.0 (#13777) @vyasr
Refactor Parquet reader handling of V2 page header info (#13775) @etseidl
Branch 23.10 merge 23.08 (#13773) @vyasr
Restructure JSON code to correctly reflect legacy/experimental status (#13757) @vuule
Branch 23.10 merge 23.08 (#13753) @vyasr
Enforce deprecations in 23.10 (#13732) @galipremsagar
Upgrade to arrow 12 (#13728) @galipremsagar
Refactors JSON reader's pushdown automaton (#13716) @elstehle
Remove Arrow dependency from the datasource.hpp public header (#13698) @vuule

v23.12.00a

7 months ago

🔗 Links

🚨 Breaking Changes

Raise error in reindex when index is not unique (#14400) @galipremsagar
Expose stream parameter to get_json_object API (#14297) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule

🐛 Bug Fixes

Fix function name typo in cudf.pandas profiler (#14514) @galipremsagar
Fix intermediate type checking in expression parsing (#14445) @vyasr
Forward merge branch-23.10 into branch-23.12 (#14435) @raydouglass
Remove needs: wheel-build-cudf. (#14427) @bdice
Fix dask dependency in custreamz (#14420) @vyasr
Ensure nvbench initializes nvml context when built statically (#14411) @robertmaynard
Support java AST String literal with desired encoding (#14402) @winningsix
Raise error in reindex when index is not unique (#14400) @galipremsagar
Always build nvbench statically so we don't need to package it (#14399) @robertmaynard
Fix token-count logic in nvtext::tokenize_with_vocabulary (#14393) @davidwendt
Fix as_column(pd.Timestamp/Timedelta, length=) not respecting length (#14390) @mroeschke
cudf.pandas: cuDF subpath checking in module __getattr__ (#14388) @shwina
Fix and disable encoding for nanosecond statistics in ORC writer (#14367) @vuule
Add the new manylinux builds to the build job (#14351) @vyasr
cudf jit parser now supports .pragma instructions with quotes (#14348) @robertmaynard
Fix overflow check in cudf::merge (#14345) @divyegala
Add cramjam (#14344) @vyasr
Enable dask_cudf/io pytests in CI (#14338) @galipremsagar
Temporarily avoid the current build of pydata-sphinx-theme (#14332) @vyasr
Fix host buffer access from device function in the Parquet reader (#14328) @vuule
Run IO tests for Dask-cuDF (#14327) @rjzamora
Fix logical type issues in the Parquet writer (#14322) @vuule
Remove aws-sdk-pinning and revert to arrow 12.0.1 (#14319) @vyasr
test is_valid before reading column data (#14318) @etseidl
Fix gtest validity setting for TextTokenizeTest.Vocabulary (#14312) @davidwendt
Fixes stack context for json lines format that recovers from invalid JSON lines (#14309) @elstehle
Downgrade to Arrow 12.0.0 for aws-sdk-cpp and fix cudf_kafka builds for new CI containers (#14296) @vyasr
fixing thread index overflow issue (#14290) @hyperbolic2346
Fix memset error in nvtext::edit_distance_matrix (#14283) @davidwendt
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record (#14279) @elstehle
Handle empty string correctly in Parquet statistics (#14257) @etseidl
Fixes behaviour for incomplete lines when recover_with_nulls is enabled (#14252) @elstehle
cudf::detail::pinned_allocator doesn't throw from deallocate (#14251) @robertmaynard
Fix strings replace for adjacent, identical multi-byte UTF-8 character targets (#14235) @davidwendt
Fix the precision when converting a decimal128 column to an arrow array (#14230) @jihoonson
Fixing parquet list of struct interpretation (#13715) @hyperbolic2346

📖 Documentation

Fix io reference in docs. (#14452) @bdice
Update README (#14374) @shwina
Example code for blog on new row comparators (#13795) @divyegala

🚀 New Features

Expose streams in public unary APIs (#14342) @vyasr
Add python tests for Parquet DELTA_BINARY_PACKED encoder (#14316) @etseidl
Update rapids-cmake functions to non-deprecated signatures (#14265) @robertmaynard
Expose streams in public null mask APIs (#14263) @vyasr
Expose streams in binaryop APIs (#14187) @vyasr
Add pylibcudf.Scalar that interoperates with Arrow scalars (#14133) @vyasr
Add decoder for DELTA_BYTE_ARRAY to Parquet reader (#14101) @etseidl
Add DELTA_BINARY_PACKED encoder for Parquet writer (#14100) @etseidl
Add BytePairEncoder class to cuDF (#13891) @davidwendt
Upgrade to nvCOMP 3.0.4 (#13815) @vuule
Use pynvjitlink for CUDA 12+ MVC (#13650) @brandon-b-miller

🛠️ Improvements

Build concurrency for nightly and merge triggers (#14441) @bdice
Cleanup remaining usages of dask dependencies (#14407) @galipremsagar
Update to Arrow 14.0.1. (#14387) @bdice
Remove Cython libcpp wrappers (#14382) @vyasr
Forward-merge branch-23.10 to branch-23.12 (#14372) @bdice
Upgrade to arrow 14 (#14371) @galipremsagar
Fix a pytest typo in test_kurt_skew_error (#14368) @galipremsagar
Use new rapids-dask-dependency metapackage for managing dask versions (#14364) @vyasr
Change nullable() to has_nulls() in cudf::detail::gather (#14363) @divyegala
Split up scan_inclusive.cu to improve its compile time (#14358) @davidwendt
Implement user_datasource_wrapper is_empty() and is_device_read_preferred(). (#14357) @tpn
Added streams to CSV reader and writer api (#14340) @shrshi
Upgrade wheels to use arrow 13 (#14339) @vyasr
Rework nvtext::byte_pair_encoding API (#14337) @davidwendt
Improve performance of nvtext::tokenize_with_vocabulary for long strings (#14336) @davidwendt
Upgrade arrow to 13 (#14330) @galipremsagar
Expose stream parameter in public nvtext replace APIs (#14329) @davidwendt
Drop pyorc dependency and use pandas/pyarrow instead (#14323) @galipremsagar
Avoid pyarrow.fs import for local storage (#14321) @rjzamora
Unpin dask and distributed for 23.12 development (#14320) @galipremsagar
Expose stream parameter in public nvtext tokenize APIs (#14317) @davidwendt
Added streams to JSON reader and writer api (#14313) @shrshi
Minor improvements in source_info (#14308) @vuule
Forward-merge branch-23.10 to branch-23.12 (#14307) @bdice
Add stream parameter to Set Operations (Public List APIs) (#14305) @SurajAralihalli
Expose stream parameter to get_json_object API (#14297) @davidwendt
Sort dictionary data alphabetically in the ORC writer (#14295) @vuule
Expose stream parameter in public strings filter APIs (#14293) @davidwendt
Refactor cudf_kafka to use skbuild (#14292) @jdye64
Update shared-action-workflows references (#14289) @AyodeAwe
Register partd encode dispatch in dask_cudf (#14287) @rjzamora
Update versioning strategy (#14285) @vyasr
Move and rename byte-pair-encoding source files (#14284) @davidwendt
Expose stream parameter in public strings combine APIs (#14281) @davidwendt
Expose stream parameter in public strings contains APIs (#14280) @davidwendt
Add stream parameter to List Sort and Filter APIs (#14272) @SurajAralihalli
Use branch-23.12 workflows. (#14271) @bdice
Refactor LogicalType for Parquet (#14264) @etseidl
Centralize chunked reading code in the parquet reader to reader_impl_chunking.cu (#14262) @nvdbaranec
Expose stream parameter in public strings replace APIs (#14261) @davidwendt
Expose stream parameter in public strings APIs (#14260) @davidwendt
Cleanup of namespaces in parquet code. (#14259) @nvdbaranec
Make parquet schema index type consistent (#14256) @hyperbolic2346
Expose stream parameter in public strings convert APIs (#14255) @davidwendt
Add in java bindings for DataSource (#14254) @revans2
Reimplement cudf::merge for nested types without using comparators (#14250) @divyegala
Add stream parameter to List Manipulation and Operations APIs (#14248) @SurajAralihalli
Expose stream parameter in public strings split/partition APIs (#14247) @davidwendt
Improve contains_column by invoking contains_table (#14238) @PointKernel
Detect and report errors in Parquet header parsing (#14237) @etseidl
Normalizing offsets iterator (#14234) @davidwendt
Forward merge 23.10 into 23.12 (#14231) @galipremsagar
Return error if BOOL8 column-type is used with integers-to-hex (#14208) @davidwendt
Enable indexalator for device code (#14206) @davidwendt
Marginally reduce memory footprint of joins (#14197) @wence-
Add nvtx annotations to spilling-based data movement (#14196) @wence-
Optimize ORC writer for decimal columns (#14190) @vuule
Remove the use of volatile in ORC (#14175) @vuule
Add bytes_per_second to distinct_count of stream_compaction nvbench. (#14172) @Blonck
Add bytes_per_second to transpose benchmark (#14170) @Blonck
cuDF: Build CUDA 12.0 ARM conda packages. (#14112) @bdice
Add bytes_per_second to shift benchmark (#13950) @Blonck
Extract debug_utilities.hpp/cu from column_utilities.hpp/cu (#13720) @ttnghia

v23.08.00

9 months ago

🚨 Breaking Changes

Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Expose streams in all public copying APIs (#13629) @vyasr
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

🐛 Bug Fixes

Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
Fix typo in wheels-test.yaml. (#13763) @bdice
Don't test strings shorter than the requested ngram size (#13758) @vyasr
Add CUDA version to custreamz build string. (#13754) @bdice
Fix writing of ORC files with empty child string columns (#13745) @vuule
Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
Fix character counting when writing sliced tables into ORC (#13721) @vuule
Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
Fix a corner case of list lexicographic comparator (#13701) @ttnghia
Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
Revert fetch-rapids changes (#13696) @vyasr
Data generator - include offsets in the size estimate of list elments (#13688) @vuule
Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
[REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
[Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
Refactor Index search to simplify code and increase correctness (#13625) @wence-
Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
Fix tz_localize for dask_cudf Series (#13610) @shwina
Fix issue with no decompressed data in ORC reader (#13609) @vuule
Fix floating point window range extents. (#13606) @mythrocks
Fix localize(None) for timezone-naive columns (#13603) @shwina
Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
Bring parity with pandas in Index.join (#13589) @galipremsagar
Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
Fix Parquet multi-file reading (#13584) @etseidl
Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
Fix the null mask size in json reader (#13537) @karthikeyann
Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
Make sure to build without isolation or installing dependencies (#13524) @vyasr
Remove preload lib from CMake for now (#13519) @vyasr
Fix missing separator after null values in JSON writer (#13503) @karthikeyann
Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
Update all versions in pyproject.toml files. (#13486) @bdice
Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
Fix chunked Parquet reader benchmark (#13482) @vuule
Update JNI JSON reader column compatability for Spark (#13477) @revans2
Fix unsanitized output of scan with strings (#13455) @davidwendt
Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

📖 Documentation

Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
Add pylibcudf to developer guide (#13639) @vyasr
Fix repeated words in doxygen text (#13598) @karthikeyann
Update docs for top-level API. (#13592) @bdice
Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
Document stream validation approach used in testing (#13556) @vyasr
Cleanup doc repetitions in libcudf (#13470) @karthikeyann

🚀 New Features

Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
Add read_parquet_metadata libcudf API (#13663) @karthikeyann
Expose streams in all public copying APIs (#13629) @vyasr
Add XXHash_64 hash function to cudf (#13612) @davidwendt
Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
Add pylibcudf subpackage with gather implementation (#13562) @vyasr
Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
Floating point order-by columns for RANGE window functions (#13512) @mythrocks
Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
Add abs function to apply (#13408) @brandon-b-miller
[FEA] AST filtering in parquet reader (#13348) @karthikeyann
[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
Update struct_minmax_util to experimental row comparator (#13069) @divyegala
Add stream parameter to hashing APIs (#12090) @vyasr

🛠️ Improvements

Pin dask and distributed for 23.08 release (#13802) @galipremsagar
Relax protobuf pinnings. (#13770) @bdice
Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
Switch to new wheel building pipeline (#13723) @vyasr
Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
Adding identify minimum version requirement (#13713) @hyperbolic2346
Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Optimize ORC reader performance for list data (#13708) @vyasr
fix limit overflow message in a docstring (#13703) @ahmet-uyar
Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
Update cython-lint and replace flake8 with ruff (#13699) @vyasr
Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
Add nvtext hash_character_ngrams function (#13654) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Acquire spill lock in to/from_arrow (#13646) @shwina
Expose stable versions of libcudf sort routines (#13634) @wence-
Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
Add convert_dtypes API (#13623) @shwina
Clean up cupy in dependencies.yaml. (#13617) @bdice
Use cuda-version to constrain cudatoolkit. (#13615) @bdice
Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
Performance improvement for cudf::strings::like (#13594) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
Add java bindings for distinct count (#13573) @revans2
Use nvcomp conda package. (#13566) @bdice
Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
Get rid of cuco::pair_type aliases (#13553) @PointKernel
Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
Clarify source of error message in stream testing. (#13541) @bdice
Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
Update to CMake 3.26.4 (#13538) @vyasr
s3 folder naming fix (#13536) @AyodeAwe
Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
Add libcufile to dependencies.yaml. (#13523) @bdice
Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
Use sizes_to_offsets_iterator in cudf::gather for strings (#13520) @davidwendt
use rapids-upload-docs script (#13518) @AyodeAwe
Support UTF-8 BOM in CSV reader (#13516) @davidwendt
Move stream-related test configuration to CMake (#13513) @vyasr
Implement cudf.option_context (#13511) @galipremsagar
Unpin dask and distributed for development (#13508) @galipremsagar
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Use test default stream (#13506) @vyasr
Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
Use east const in include files (#13494) @karthikeyann
Use east const in src files (#13493) @karthikeyann
Use east const in tests files (#13492) @karthikeyann
Use east const in benchmarks files (#13491) @karthikeyann
Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
Use pandas public APIs where available (#13467) @mroeschke
Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Separate io-text and nvtext pytests into different files (#13435) @davidwendt
Add a move_to function to cudf::string_view::const_iterator (#13428) @davidwendt
Allow newer scikit-build (#13424) @vyasr
Refactor sort_by_values to sort_values, drop indices from return values. (#13419) @bdice
Inline Cython exception handler (#13411) @vyasr
Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
Refactor ORC reader (#13396) @ttnghia
JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
Add tests of currently unsupported indexing (#13338) @wence-
Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
Add stacktrace into cudf exception types (#13298) @ttnghia
cuDF: Build CUDA 12 packages (#12922) @bdice

v23.06.00a

10 months ago

🔗 Links

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

v23.06.01

10 months ago

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

v23.06.00

11 months ago

🚨 Breaking Changes

Fix batch processing for parquet writer (#13438) @ttnghia
Use <NA> instead of null to match pandas. (#13415) @bdice
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Remove null mask and null count from column_view constructors (#13311) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Cleanup Parquet chunked writer (#13094) @ttnghia
Cleanup ORC chunked writer (#13091) @ttnghia
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Remove deprecated regex functions from libcudf (#13067) @davidwendt
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🐛 Bug Fixes

Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
Fix writing of ORC files with empty rowgroups (#13466) @vuule
Fix cudf::repeat logic when count is zero (#13459) @davidwendt
Fix batch processing for parquet writer (#13438) @ttnghia
Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
Use <NA> instead of null to match pandas. (#13415) @bdice
Fix tokenize with non-space delimiter (#13403) @shwina
Fix groupby head/tail for empty dataframe (#13398) @shwina
Default to closed="right" in IntervalIndex constructor (#13394) @shwina
Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
Fix unused argument errors in nvcc 11.5 (#13387) @abellina
Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
Fix page size estimation in Parquet writer (#13364) @etseidl
Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
Support gcc 12 as the C++ compiler (#13316) @robertmaynard
Correctly set bitmask size in from_column_view (#13315) @wence-
Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
Fix parquet schema interpretation issue (#13277) @hyperbolic2346
Fix 64bit shift bug in avro reader (#13276) @karthikeyann
Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
Clean up buffers in case AssertionError (#13262) @razajafri
Allow empty input table in ast compute_column (#13245) @wence-
Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
Fix the row index stream order in ORC reader (#13242) @vuule
Make is_decompression_disabled and is_compression_disabled thread-safe (#13240) @vuule
Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
Fix race in ORC string dictionary creation (#13214) @revans2
Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
Fix hostdevice_vector::subspan (#13187) @ttnghia
Use custom nvbench entry point to ensure cudf::nvbench_base_fixture usage (#13183) @robertmaynard
Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
Fix a few clang-format style check errors (#13146) @davidwendt
[REVIEW] Fix Series and DataFrame constructors to validate index lengths (#13122) @galipremsagar
Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
Adds checks to make sure json reader won't overflow (#13115) @elstehle
Fix null_count of columns returned by chunked_parquet_reader (#13111) @vuule
Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
[REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
Raise NotImplementedError when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina
Fix column selection read_parquet benchmarks (#13082) @vuule
Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
Add algorithm include in data_sink.hpp (#13068) @ahendriksen
Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
Prevent overflow with skip_rows in ORC and Parquet readers (#13063) @vuule
Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
[REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
Fix read_avro() skip_rows and num_rows. (#12912) @tpn
Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina

🚀 New Features

Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
cuDF numba cuda 12 updates (#13337) @brandon-b-miller
Add tz_convert method to convert between timestamps (#13328) @shwina
Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
Support the case=False argument to str.contains (#13290) @shwina
Add an event handler for ColumnVector.close (#13279) @abellina
JNI api for cudf::chunked_pack (#13278) @abellina
Implement a chunked_pack API (#13260) @abellina
Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
JNI changes for range-extents in window functions. (#13199) @mythrocks
Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
Add IS_NULL operator to AST (#13145) @karthikeyann
STRING order-by column for RANGE window functions (#13143) @mythrocks
Update contains_table to experimental row hasher and equality comparator (#13119) @divyegala
Automatically select GroupBy.apply algorithm based on if the UDF is jittable (#13113) @brandon-b-miller
Refactor Parquet chunked writer (#13076) @ttnghia
Add Python bindings for string literal support in AST (#13073) @karthikeyann
Add Java bindings for string literal support in AST (#13072) @karthikeyann
Add string scalar support in AST (#13061) @karthikeyann
Log cuIO warnings using the libcudf logger (#13043) @vuule
Update mixed_join to use experimental row hasher and comparator (#13028) @divyegala
Support structs of lists in row lexicographic comparator (#13005) @ttnghia
Adding hostdevice_span that is a span createable from hostdevice_vector (#12981) @hyperbolic2346
Add nvtext::minhash function (#12961) @davidwendt
Support lists of structs in row lexicographic comparator (#12953) @ttnghia
Update join to use experimental row hasher and comparator (#12787) @divyegala
Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller

🛠️ Improvements

Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
Handle some corner-cases in indexing with boolean masks (#13402) @wence-
Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
[JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
Fix JNI method with mismatched parameter list (#13384) @ttnghia
Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
Move some nvtext benchmarks to nvbench (#13368) @davidwendt
run docs nightly too (#13366) @AyodeAwe
Add warning for default dtype parameter in get_dummies (#13365) @galipremsagar
Add log messages about kvikIO compatibility mode (#13363) @vuule
Switch back to using primary shared-action-workflows branch (#13362) @vyasr
Deprecate StringIndex and use Index instead (#13361) @galipremsagar
Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
Expunge most uses of TypeVar(bound="Foo") (#13346) @wence-
Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
Improve distinct_count with cuco::static_set (#13343) @PointKernel
Fix contiguous_split performance (#13342) @ttnghia
Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
Update mypy to 1.3 (#13340) @wence-
[Java] Purge non-empty nulls when setting validity (#13335) @razajafri
Add row-wise filtering step to read_parquet (#13334) @rjzamora
Performance improvement for nvtext::minhash (#13333) @davidwendt
Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
Move meta calculation in dask_cudf.read_parquet (#13327) @rjzamora
Changes to support Numpy >= 1.24 (#13325) @shwina
Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
Clean up distinct_count benchmark (#13321) @PointKernel
Fix gtest pinning to 1.13.0. (#13319) @bdice
Remove null mask and null count from column_view constructors (#13311) @vyasr
Address feedback from 13289 (#13306) @vyasr
Change default value of the observed= argument in groupby to True to reflect the actual behaviour (#13296) @shwina
First check for BaseDtype when infering the data type of an arbitrary object (#13295) @shwina
Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
Support CUDA 12.0 for pip wheels (#13289) @divyegala
Refactor transform_lists_of_structs in row_operators.cu (#13288) @ttnghia
Branch 23.06 merge 23.04 (#13286) @vyasr
Update cupy dependency (#13284) @vyasr
Performance improvement in cudf::strings::join_strings for long strings (#13283) @davidwendt
Fix unused variables and functions (#13275) @karthikeyann
Fix integer overflow in partition scatter_map construction (#13272) @wence-
Numba 0.57 compatibility fixes (#13271) @gmarkall
Performance improvement in cudf::strings::all_characters_of_type (#13259) @davidwendt
Remove default null-count parameter from some libcudf factory functions (#13258) @davidwendt
Roll our own generate_string() because mimesis' has gone away (#13257) @shwina
Build wheels using new single image workflow (#13249) @vyasr
Enable sccache hits from local builds (#13248) @AyodeAwe
Revert to branch-23.06 for shared-action-workflows (#13247) @shwina
Introduce pandas_compatible option in cudf (#13241) @galipremsagar
Add metadata_builder helper class (#13232) @abellina
Use libkvikio conda packages in libcudf, add explicit libcufile dependency. (#13231) @bdice
Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
Performance improvement in cudf::strings::find/rfind for long strings (#13226) @davidwendt
Add chunked reader benchmark (#13223) @SrikarVanavasam
Set the null count in output columns in the CSV reader (#13221) @vuule
Skip Non-Empty nulls tests for the nightly build just like we skip CuFileTest and CudaFatalTest (#13213) @razajafri
Fix string_scalar stream usage in write_json.cu (#13212) @davidwendt
Use canonicalized name for dlopen'd libraries (libcufile) (#13210) @shwina
Refactor pinned memory vector and ORC+Parquet writers (#13206) @ttnghia
Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
Optimization to decoding of parquet level streams (#13203) @nvdbaranec
Clean up and simplify gpuDecideCompression (#13202) @vuule
Use std::array for a statically sized vector in create_serialized_trie (#13201) @vuule
Update minimum Python version to Python 3.9 (#13196) @shwina
Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
Remove usage of rapids-get-rapids-version-from-git (#13184) @jjacobelli
Enable mixed-dtype decimal/scalar binary operations (#13171) @shwina
Split up unique_count.cu to improve build time (#13169) @davidwendt
Use nvtx3 includes in string examples. (#13165) @bdice
Change some .cu gtest files to .cpp (#13155) @davidwendt
Remove wheel pytest verbosity (#13151) @sevagh
Fix libcudf to always pass null-count to set_null_mask (#13149) @davidwendt
Fix gtests to always pass null-count to set_null_mask calls (#13148) @davidwendt
Optimize JSON writer (#13144) @karthikeyann
Performance improvement for libcudf upper/lower conversion for long strings (#13142) @davidwendt
[REVIEW] Deprecate pad and backfill methods (#13140) @galipremsagar
Use CTAD instead of functions in ProtobufReader (#13135) @vuule
Remove more instances of UNKNOWN_NULL_COUNT (#13134) @vyasr
Update clang-format to 16.0.1. (#13133) @bdice
Add log messages about cuIO's nvCOMP and cuFile use (#13132) @vuule
Branch 23.06 merge 23.04 (#13131) @vyasr
Compute null-count in cudf::detail::slice (#13124) @davidwendt
Use ARC V2 self-hosted runners for GPU jobs (#13123) @jjacobelli
Set null-count in linked_column_view conversion operator (#13121) @davidwendt
Adding ifdefs around nvcc-specific pragmas (#13110) @hyperbolic2346
Add null-count parameter to json experimental parse_data utility (#13107) @davidwendt
Remove uses-setup-env-vars (#13105) @vyasr
Explicitly compute null count in concatenate APIs (#13104) @vyasr
Replace unnecessary uses of UNKNOWN_NULL_COUNT (#13102) @vyasr
Performance improvement for cudf::string_view::find functions (#13100) @davidwendt
Use .element() instead of .data() for window range calculations (#13095) @mythrocks
Cleanup Parquet chunked writer (#13094) @ttnghia
Fix unused variable error/warning in page_data.cu (#13093) @davidwendt
Cleanup ORC chunked writer (#13091) @ttnghia
Remove using namespace cudf; from libcudf gtests source (#13089) @davidwendt
Change cudf::test::make_null_mask to also return null-count (#13081) @davidwendt
Resolved automerger from branch-23.04 to branch-23.06 (#13080) @galipremsagar
Assert for non-empty nulls (#13071) @razajafri
Remove deprecated regex functions from libcudf (#13067) @davidwendt
Refactor cudf::detail::sorted_order (#13062) @ttnghia
Improve performance of slice_strings for long strings (#13057) @davidwendt
Reduce shared memory usage in gpuComputePageSizes by 50% (#13047) @nvdbaranec
[REVIEW] Add notes to performance comparisons notebook (#13044) @galipremsagar
Enable binary operations between scalars and columns of differing decimal types (#13034) @shwina
Remove console output from some libcudf gtests (#13027) @davidwendt
Remove underscore in build string. (#13025) @bdice
Bump up JNI version 23.06.0-SNAPSHOT (#13021) @pxLi
Fix auto merger from branch-23.04 to branch-23.06 (#13009) @galipremsagar
Reduce peak memory use when writing compressed ORC files. (#12963) @vuule
Add nvtx annotatations to groupby methods (#12941) @wence-
Compute column sizes in Parquet preprocess with single kernel (#12931) @SrikarVanavasam
Add Python bindings for time zone data (TZiF) reader (#12826) @shwina
Optimize set-like operations (#12769) @ttnghia
[REVIEW] Upgrade to arrow-11 (#12757) @galipremsagar
Add empty test files for test reorganization (#12288) @shwina

v23.08.00a

11 months ago

🔗 Links

🚨 Breaking Changes

Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Expose streams in all public copying APIs (#13629) @vyasr
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

🐛 Bug Fixes

Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
Fix character counting when writing sliced tables into ORC (#13721) @vuule
Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
Fix a corner case of list lexicographic comparator (#13701) @ttnghia
Fix combined filtering and column projection in dask_cudf.read_parquet (#13697) @rjzamora
Revert fetch-rapids changes (#13696) @vyasr
Data generator - include offsets in the size estimate of list elments (#13688) @vuule
Add cuda-nvcc-impl to cudf for numba CUDA 12 (#13673) @jakirkham
Fix combined filtering and column projection in read_parquet (#13666) @rjzamora
Use thrust::identity as hash functions for byte pair encoding (#13665) @PointKernel
Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
[REVIEW] Introduce parity with pandas for MultiIndex.loc ordering & fix a bug in Groupby with as_index (#13657) @galipremsagar
Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
Fix has_nonempty_nulls ignoring column offset (#13647) @ttnghia
[Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
Refactor Index search to simplify code and increase correctness (#13625) @wence-
Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
Fix tz_localize for dask_cudf Series (#13610) @shwina
Fix issue with no decompressed data in ORC reader (#13609) @vuule
Fix floating point window range extents. (#13606) @mythrocks
Fix localize(None) for timezone-naive columns (#13603) @shwina
Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
Bring parity with pandas in Index.join (#13589) @galipremsagar
Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
Fix Parquet multi-file reading (#13584) @etseidl
Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
Fix an issue with dask_cudf.read_csv when lines are needed to be skipped (#13555) @galipremsagar
Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
Fix the null mask size in json reader (#13537) @karthikeyann
Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
Make sure to build without isolation or installing dependencies (#13524) @vyasr
Remove preload lib from CMake for now (#13519) @vyasr
Fix missing separator after null values in JSON writer (#13503) @karthikeyann
Ensure single_lane_block_sum_reduce is safe to call in a loop (#13488) @wence-
Update all versions in pyproject.toml files. (#13486) @bdice
Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
Fix chunked Parquet reader benchmark (#13482) @vuule
Update JNI JSON reader column compatability for Spark (#13477) @revans2
Fix unsanitized output of scan with strings (#13455) @davidwendt
Reject functions without bytecode from _can_be_jitted in GroupBy Apply (#13429) @brandon-b-miller
Fix decimal scale reductions in _get_decimal_type (#13224) @charlesbluca

📖 Documentation

Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
Add pylibcudf to developer guide (#13639) @vyasr
Fix repeated words in doxygen text (#13598) @karthikeyann
Update docs for top-level API. (#13592) @bdice
Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
Document stream validation approach used in testing (#13556) @vyasr
Cleanup doc repetitions in libcudf (#13470) @karthikeyann

🚀 New Features

Support min and max aggregations for list type in groupby and reduction (#13676) @ttnghia
Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
Expose streams in all public copying APIs (#13629) @vyasr
Add XXHash_64 hash function to cudf (#13612) @davidwendt
Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
Use cuco::static_map to build string dictionaries in ORC writer (#13580) @vuule
Add pylibcudf subpackage with gather implementation (#13562) @vyasr
Add JNI for lists::concatenate_list_elements (#13547) @ttnghia
Enable nested types for lists::concatenate_list_elements (#13545) @ttnghia
Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
Remove numba kernels from find_index_of_val (#13517) @brandon-b-miller
Floating point order-by columns for RANGE window functions (#13512) @mythrocks
Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
Add abs function to apply (#13408) @brandon-b-miller
[FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
Update struct_minmax_util to experimental row comparator (#13069) @divyegala
Add stream parameter to hashing APIs (#12090) @vyasr

🛠️ Improvements

Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
Adding identify minimum version requirement (#13713) @hyperbolic2346
Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
fix limit overflow message in a docstring (#13703) @ahmet-uyar
Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
Update cython-lint and replace flake8 with ruff (#13699) @vyasr
Add __dask_tokenize__ definitions to cudf classes (#13695) @rjzamora
Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
Add nvtext hash_character_ngrams function (#13654) @davidwendt
Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
Acquire spill lock in to/from_arrow (#13646) @shwina
Expose stable versions of libcudf sort routines (#13634) @wence-
Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
Add convert_dtypes API (#13623) @shwina
Clean up cupy in dependencies.yaml. (#13617) @bdice
Use cuda-version to constrain cudatoolkit. (#13615) @bdice
Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
Performance improvement for cudf::strings::like (#13594) @davidwendt
Remove deprecated cudf.set_allocator. (#13591) @bdice
Clean up cudf device atomic with cuda::atomic_ref (#13583) @PointKernel
Add java bindings for distinct count (#13573) @revans2
Use nvcomp conda package. (#13566) @bdice
Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion (#13558) @rjzamora
Get rid of cuco::pair_type aliases (#13553) @PointKernel
Introduce parity with pandas when sort=False in Groupby (#13551) @galipremsagar
Update CMake in docker to 3.26.4 (#13550) @NvTimLiu
Clarify source of error message in stream testing. (#13541) @bdice
Deprecate strings_to_categorical in cudf.read_parquet (#13540) @galipremsagar
Update to CMake 3.26.4 (#13538) @vyasr
s3 folder naming fix (#13536) @AyodeAwe
Implement iloc-getitem using parse-don't-validate approach (#13534) @wence-
Make synchronization explicit in the names of hostdevice_* copying APIs (#13530) @ttnghia
Add benchmark (Google Benchmark) dependency to conda packages. (#13528) @bdice
Add libcufile to dependencies.yaml. (#13523) @bdice
Fix some memoization logic in groupby/sort/sort_helper.cu (#13521) @davidwendt
Use sizes_to_offsets_iterator in cudf::gather for strings (#13520) @davidwendt
use rapids-upload-docs script (#13518) @AyodeAwe
Support UTF-8 BOM in CSV reader (#13516) @davidwendt
Move stream-related test configuration to CMake (#13513) @vyasr
Implement cudf.option_context (#13511) @galipremsagar
Unpin dask and distributed for development (#13508) @galipremsagar
Change build.sh to use pip install instead of setup.py (#13507) @vyasr
Use test default stream (#13506) @vyasr
Remove documentation build scripts for Jenkins (#13495) @ajschmidt8
Use east const in include files (#13494) @karthikeyann
Use east const in src files (#13493) @karthikeyann
Use east const in tests files (#13492) @karthikeyann
Use east const in benchmarks files (#13491) @karthikeyann
Performance improvement for nvtext tokenize/token functions (#13480) @davidwendt
Add pd.Float*Dtype to Avro and ORC mappings (#13475) @mroeschke
Use pandas public APIs where available (#13467) @mroeschke
Allow pd.ArrowDtype in cudf.from_pandas (#13465) @mroeschke
Rework libcudf regex benchmarks with nvbench (#13464) @davidwendt
Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
Separate io-text and nvtext pytests into different files (#13435) @davidwendt
Add a move_to function to cudf::string_view::const_iterator (#13428) @davidwendt
Allow newer scikit-build (#13424) @vyasr
Refactor sort_by_values to sort_values, drop indices from return values. (#13419) @bdice
Inline Cython exception handler (#13411) @vyasr
Init JNI version 23.08.0-SNAPSHOT (#13401) @pxLi
Refactor ORC reader (#13396) @ttnghia
JNI: Remove cleaned objects in memory cleaner (#13378) @res-life
Add tests of currently unsupported indexing (#13338) @wence-
Performance improvement for some libcudf regex functions for long strings (#13322) @davidwendt
Exposure Tracked Buffer (first step towards unifying copy-on-write and spilling) (#13307) @madsbk
Write string data directly to column_buffer in Parquet reader (#13302) @etseidl
Add stacktrace into cudf exception types (#13298) @ttnghia
cuDF: Build CUDA 12 packages (#12922) @bdice