Catboost Versions Save

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

v1.2.3

1 month ago

Python package

  • Support Python 3.12. #2510
  • [Performance]: Fix ineffective loops in Cython. Significant speedups (up to 3x) on dataset construction from data in C-order can be expected.
  • [Performance]: Make features data initialization from C-order numpy.ndarrays with float32 data type multithreaded. Significant speedups of 5x up to 10x (on CPUs with many cores) can be expected. #385, #2542
  • Save training metrics into the model metadata. So best_score_, evals_result_, best_iteration_ model attributes now work after model saving and loading. Can be removed by model metadata manipulation if needed. #1166
  • [Breaking change]. Support a separate boolean target type, now Class predictions for models that have been trained with boolean targets will also be boolean instead of True, False strings as before. Such models will be incompatible with the previous versions of CatBoost appliers. If you want the old behavior convert your target to False, True strings before training. #1954
  • Restrict jupyterlab version for setup to 3.x for now. Fixes #2530
  • utils.read_cd: Support CD files with non-increasing column indices.
  • Make log_cout, log_cerr specification consistent, avoid reset in recursive calls.
  • Late-initialize default values for log_cout, log_cerr. #2195
  • Add missing generated metrics: Cox, PairLogitPairwise, UserPerObjMetric, SurvivalAft.

New features

  • Support boolean target/labels type during training in Python and Spark (in the latter case only when using fit with Pool arguments) and Class prediction in Python. #1954
  • [Spark]: Support Spark 3.5.x.
  • [C/C++ applier]. Add functions for getting indices of features of different types to C and C++ API. #2568. Thanks to @nimusp.
  • [C/C++ applier]. Add staged prediction functions to C API. #2584. Thanks to @Mb-NextTime.
  • [JVM applier]. Add loading CatBoostModel from a byte array to API. #2539
  • [Linux] Support CgroupsV2 when computing default number of threads used in parallel computations. #2519. Thanks to @elukey.
  • [CLI] Support printing Auxiliary columns by name in evaluation result output. #1659
  • Save training metrics into the model metadata. Can be removed by model metadata manipulation if needed. #1166

Build & testing

  • [Windows]: Use clang-cl compiler and tools from Visual Studio 2022 for the build without CUDA (build with CUDA still uses standard Microsoft toolchain from Visual Studio 2019).
  • [macOS]: Pass os.version to conan host settings to ensure version consistency.
  • [Linux aarch64]: Set -mno-outline-atomics for modern versions of CLang and GCC to avoid unresolved symbols linking errors. #2527
  • Added missing CMakeLists for unit tests for util. #2525

Bugfixes

  • [Performance]: Fix performance regression that could slow down training on GPU by 50% on some datasets that had been introduced in release 1.2. Thanks to @JeanPaulShapo.
  • [Python-package]: Fix segfault on Pool(data=None). #2522
  • [Python-package]: Fix Python exception in Pool() when pairs_weight is a numpy array. #1913
  • [Python-package]: Fix segfault and other strange errors when specifying custom logger with __call__ method. #2277
  • [Python-package]: Fix returning complex params in hyperparameter search. #1741, #1833
  • [Python-package]: Fix ignored exceptions for missed metrics descriptions on startup. This has not been visible to users but has been making debugging more difficult.
  • [Python-package]: Fix misleading Targets are required for YetiRank loss function. error in Cross validation. #2083
  • [Python-package]: Fix Pool.get_label() returns constant True for boolean labels. #2133
  • [Python-package]: Copying models does not lose best_score_, evals_result_, best_iteration_ attributes values anymore. #1793
  • [Spark]: Fix hangs at the end of the training. #2151
  • Precision metric default value in the absense of positive samples is changed to 0 and a warning is added (similar to the behavior of scikit-learn implementation). #2422
  • Fix ignoring embedding features
  • Try to avoid hash collisions when computing group ids with datasets with a lot of groups (may occur in datasets with around a 10^9 samples).
  • Fix Multiclass models export to C++ and Python code. #2549
  • Fix dataset_statistics mode when no Target data is available.
  • Fix Error: can't proceed some features error on GPU. #1024
  • Fix allow_const_label=True for classification. #1933
  • Add checking of approx and target dimensions for SurvivalAft objective/metric.
  • Fix Focal loss derivatives sign. #2563

v1.2.2

6 months ago

Bugfixes

  • Fix Segmentation fault when using custom eval_metric in binary python packages of version 1.2.1 on PyPI. #2486
  • Fix LossFunctionChange fstr with embedding features.
  • Fix a segmentation fault in JVM applier when using embedding features on JVM 11+.
  • Fix CTR data handling in model summation (especially for models with CTRs with multiple target quantizations).

v1.2.1

7 months ago

New features

  • Allow to optimize specific ranking loss functions with YetiRank and YetiRankPairwise by specifying mode parameter. See Which Tricks are Important for Learning to Rank? paper for details (this family of losses is called YetiLoss there). CPU-only for now.
  • Add Kernel Gradient Boosting support (use catboost.sample_gaussian_process function). #2408, thanks to @TakeOver. See Gradient Boosting Performs Gaussian Process Inference paper for details.
  • LambdaMart loss: support new target metrics MRR, ERR and MAP.
  • StochasticRank loss: support new target metrics ERR and MRR.
  • Support MultiRMSE on GPU. #2264, #2390
  • Load JSON model format in Java Client. #1627, thanks to @timotta
  • Implement exporting of Multiclass models to C++ and Python. #2283, thanks to @antoninkriz

Improvements

  • Speedup BM25 feature calcers 3x
  • Use int instead of deprecated numpy.int. #2378
  • Add ModelCalcerWrapper::CalcFlatTransposed, #2413 thanks to @faucct
  • Update dependencies to avoid known vulnerabilities

Bugfixes

  • Fix __shfl_up_sync mask. #2339
  • TFocalMetric negative values fix. #2386, thanks to @diditforlulz273
  • Focal loss: Use user-defined alpha and gamma
  • Fix exception propagation: Rethrow exceptions caused by user's python code as C++ exceptions
  • CatBoost trained with user defined objective was incompatible with ShapValues calculation
  • Avoid nan's in Newton step calculation for RMSEWithUncertainty
  • Fix score method for y with shape (N, 1). #2405
  • Fix scalePosWeight support for Spark. #2470

v1.2

10 months ago

Release 1.2

Major changes

CatBoost's build system has been switched from Ya Make (Yandex's build system) to CMake. This means more transparency in the build process and more familiar tools for Open Source developers. For now it is possible to build CatBoost for:

  • Linux on x86-64 with or without CUDA
  • Linux on aarch64 with or without CUDA
  • macOS on x86-64 and arm64, including creating universal binaries
  • Windows on x86-64 with or without CUDA
  • Android (only model applier) on All supported ABIs.

This allowed us to prepare the Python package in the source distribution form (also known as sdist). #830

  • msvs subdirectory with the Microsoft Visual Studio solution has been removed. Visual Studio solutions can be generated using CMake instead.
  • make subdirectory with Makefiles has been removed. Use CMake + ninja (recommended) or CMake + make instead.

Python package

  • Switch to the standard Python build and installation method that uses setup.py instead of the custom mk_wheel.py script. All common scenarios (sdist, build, install, editable install, bdist_wheel) are supported.
  • Switch wheel platform tag on Linux from obsolete manylinux1 to manylinux2014.
  • The source distribution is now available on PyPI. #830
  • Wheels for Linux aarch64 are now available on PyPI. #2091
  • Support Python 3.11. #2213
  • Drop support for obsolete Python 3.6.
  • Make wheels PEP427-compliant. #2165
  • Fix wrong checksums in wheels that caused problems with poetry. #2331
  • Improved performance due to caching TBB local executors. #2203
  • Add fixed_binary_splits to the regressor, classifier, and ranker.
  • Compatibility with pandas 2.0. #2320
  • CatBoost widget is now compatible with ipywidgets 8.x. #2266

Rust package

  • Support CUDA applier. #1925, thanks to @getumen.
  • Properly forward debug/release setting to native library build.
  • Passing features: switch from String and Vec types for features to AsRef of slices to make code more generic
  • Support text and embedding features.
  • Support multidimensional output in predictions.

New features

  • [JVM applier]: Support CUDA.
  • [Spark]: Support Spark 3.4.x (if you want to use Spark with python 3.11 use this version).
  • Static model applier library now works on Windows.
  • Add binary-classification-threshold parameter to the CLI model applier.
  • Support Multi-target regression with text features (but only Bag-of-Words features are generated for now). #2229
  • Support RMSEWithUncertainty loss function on GPU.
  • Support MultiLogloss and MultiCrossEntropy loss functions with numerical features on GPU.
  • Support MultiLogloss loss function with text features on CPU and GPU. #1885
  • Enable univariate metrics for models with uncertainty
  • Add Focal loss (CPU-only for now). #1807, thanks to @diditforlulz273.

Improvements

  • Removed legacy dependency on Python 2 interpreter in the build process. #2297
  • Calc metrics: Throw catboost exception if column index exceeds column count.
  • Speedup MultiLogloss on CPU by 8% per tree (110K samples, 20 targets, 480 float features, 3 cat features, 16 cores CPU).
  • Update .NET projects from obsolete .NET Core 2.1 to .NET Core 3.1.
  • Code generation for new CUDA Compute Architectures 8.6, 8.9 and 9.0 is enabled by default (requires CUDA 11.8 to build from source).
  • Check that evaluator implementation is available in TFullModel::SetEvaluatorType (it was possible to get a Segmentation fault when calling it for non-available implementstion). Add TFullModel::GetSupportedEvaluatorTypes.
  • Cross Validation on GPU no longer requires allow_write_files=True.

Bugfixes

  • [Python-package]: Clear model params before load_model. Fixes #2225.
  • [Python-package]: Fix CatBoostRanker score computation. #2231
  • [Python-package]: Fix _get_embedding_feature_indices. #2273
  • [Python-package]: Fix set_feature_names with text or embedding features. #2090
  • [Python-package]: pandas.Categorical.categories is not necessarily a numpy.ndarray. #1965
  • [Spark]: Pass classpath in a file to avoid hitting cmdline length limits. #1842
  • [CUDA Applier]: Apply scale and bias.
  • [CUDA Applier]: Fix that libs/model_interface applier always produced an error in CUDA mode.
  • Fix CUDA error 700 in pairwise ranking.
  • Fix kernel registration for distributed training on GPU.
  • Fix `floating point exception' on CPU for small datasets on GPU.
  • Fix wrong log message 'There are invalid params and some of them will be ignored'. #2253
  • Fix incorrect results and crashes for GPU applier on Nvidia Ampere - based GPUs.
  • Fix 'CUDA error 9' in Multi-GPU training.
  • Fix serialization of embedding features structures in the model.
  • Fix GPU buffer overrun in distributed multi-classification training.
  • Fix catboost/cuda/cuda_util/sort.cpp:166: CUDA error 9 on Nvidia Ampere - based GPUs.
  • Fix inf/nan parsing in dataset input files.
  • Fix floating point exception for very small datasets on GPU.
  • Fix: built static applier library lacked the part with 'global' objects. #2187
  • Fix sum of models with categorical features with CTRs.
  • Fix: model_interface/cmake_example failed build "‘runtime_error’ is not a member of ‘std’". #2324, thanks to @Mandelag.
  • Fix Segmentation fault in Cross Validation and hyperparameter search functions that use it on GPU.
  • Fix Segmentation fault in utils.eval_metrics for groupwise metrics when group data has not been specified. #2343
  • Fix errors when running Cross Validation repeatedly on GPU. #2221

P.S. There's an issue with somewhat unexpected binary size increases. We're investingating in #2369

v1.1.1

1 year ago

Release 1.1.1

New features

  • Support building for Linux on aarch64 from sources using CMake (no prebuilt binaries or PyPI packages yet). #1981
  • [C/C++ applier] Support embedding features. #2172
  • [C/C++ applier] Add GetModelUsedFeaturesNames. #2204
  • [Python] Add text features to utils.create_cd. #2193
  • [Spark] Full support for Apache Spark 3.3
  • [Spark] Read/write PySpark's DataFrame-like API for Pool. #2030
  • [Spark] Allow to specify trainingDriver and worker listening ports. #2181

Bugfixes

  • Fix prediction dimension check for RMSEWithUncertainty and MultiQuantile. #2155
  • [C/C++ applier] Fix segmentation fault in prediction for multiple objects for multiple dimension models.
  • [JVM applier] Fix catboost-common dependency version in catboost-prediction (Fixes JVM applier on macOS). #2121
  • [Python] Update for pandas 1.5.0: iteritems -> items (Fixes annoying deprecation warning). #2179
  • [Python] Fix segmentation fault when target is np.ndarray with dtype=object. #2201
  • [Python] Fix specifying feature_names in utils.create_cd. #2211

v1.1

1 year ago

Release 1.1

New features

  • Multiquantile regression

    Now it's possible to train models with shared tree structure and multiple predicted quantile values in each leaf. Currently this approach doesn't give a strong guarantee for predicted quantile values consistency, but it still provides more consistency than training multiple independent models for each quantile. You can read short description in the documentation. Short example for Python: loss_function='MultiQuantile:alpha=0.2,0.4'. Supported only on CPU for now.

  • Support text and embedding features for regression and ranking.

  • Spark: Read/write Spark's Dataset-like API for Pool. #2030

  • Support HashedCateg column type. This allows to use externally prehashed categorical features both in training and prediction.

  • New option plot_file in Python functions with plot parameter allows to save plots to file. #758

  • Add eval_fraction parameter. #1500

  • Non-symmetric trees model summation.

  • init_model parameter now works with non-symmetric trees.

  • Partial support for Apache Spark 3.3 (only for Scala 2.12 and without PySpark).

Speedups

  • 2x speedup DCG, nDCG and FilteredDCG metrics calculation for groups with >= 50 objects and with top=-1 (all objects from each group, default value)
  • Fixed 2x slowdown of PairLogit and other ranking losses on CPU introduced in release 0.23

Bugfixes

  • Fix for pandas integer array. #2096
  • Save feature names to json format. #2102
  • Fix feature weights on CPU
  • Use feature weights on GPU
  • Fix gradient calculation for QueryRMSE on GPU
  • Fix ranking metrics with group weights in calc_metrics
  • Fix JVM applier on data with text features. #2132

v1.0.6

1 year ago

Release 1.0.6

New features

  • Fixed splits for binary features on gpu for non-symmetric trees -- specify the set of splits to start each tree in the model with --fixed-binary-splits or fixed_binary_splits in Python package (by default, there are no fixed splits)

Documentation

Bug-fixes

  • Fix warning about resetting logger when logging to sys.stdout & sys.stderr from different threads #1855
  • Fix model summation in CatBoost for Apache Spark
  • Fix performance and scalability of query auc for ranking (1m samples, query size 2, 8 cpu cores 0.55s -> 0.04s)
  • Fix support for text features and embeddings in Java applier #2043
  • Fix nan/inf split scores with yeti rank pairwise loss
  • Fix nan/inf feature strengths in pair logit on cpu

v1.0.5

1 year ago

Release 1.0.5

New features

  • Support Apple Darwin arm64 architecture. #1526.
  • Support feature tags in feature selection.
  • Support for Apache Spark 3.2.
  • Model sum in Apache Spark.

Python package

  • Accommodate multiple target-platform arguments used to build universal binaries.
  • Add grid creation function to utils.py
  • Custom multilabel eval metrics by @ELitvinova
  • Metrics plotter by @evgenabramov
  • Fbeta score by @ELitvinova

Bugfixes

  • Fix group weights in metrics calculation.
  • Fix fit for PySpark estimators. #1976.
  • Fix predict on GPU. #1901, #1923.
  • Disable exact leafs calculation for MAE, MAPE, Quantile on GPU.
  • Fix counter description for plotting. #1973.
  • Allow weights in BrierScore. #1967.
  • Disable AUC calculation for learn by default on GPU as well.
  • Fix plot_tree example in documentation.
  • Fix plots in cv.
  • Fix ui32 overflows in pairwise losses on GPU.
  • Fix for multiclass in nodejs evaluator. #1903.
  • Fix CatBoost R package installation on Monterey. #1912.
  • Fix CUDA error 700 caused by data race in mimalloc and CUDA driver.
  • Fix slow compilation with CUDA 11.2+.
  • Fix 2nd derivative in RMSEWithUncertainty.

v1.0.4

2 years ago

New features

  • Add sort param to FilteredDCG metric.
  • Add StochasticRank for FilteredDCG.

Python package

  • add is_max/minimizable methods. #1915
  • Support custom metric in select_features #1920

R package

  • Register functions from libcatboostr natively in R, removing one of CRAN notes.

Bugfixes

  • Fix apply for models without main loss_function.
  • Fix text calcer options specification. #1916
  • Fix calc_feature_statistics
  • Fix Multi-approx support in CLI calc_metrics mode.
  • Fix processing for text options. #1930
  • Fix snapshot saving in feature selection.
  • Fix CatBoost models serialization inside pipeline models in PySpark. #1936

v1.0.3

2 years ago

CatBoost for Apache Spark

  • Fix incorrect Linux so files in deployed Maven artifacts for release 1.0.2 (no code changes)