Catboost Versions Save

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

v1.0.3

2 years ago

CatBoost for Apache Spark

Fix incorrect Linux so files in deployed Maven artifacts for release 1.0.2 (no code changes)

v1.0.2

2 years ago

CatBoost for Apache Spark

PySpark: Fix python -> JVM datetime.timedelta conversion.
Fix: proper handling of constant categorical features. #1867
Fix SIGSEGV for for Multiclassification with Ctrs. #1886

New features

Add is_min_optimal, is_max_optimal for BuiltinMetrics. #1890

R package

Use libcatboostr-darwin.dylib instead of libcatboostr-darwin.so on macOS. #1834

Bugfixes

Fix CatBoostError: (No such file or directory) bad new file name when using grid_search. #1893

v1.0.1

2 years ago

:warning: PySpark support is broken in this release.. Please use release 1.0.3 instead.

CatBoost for Apache Spark

More robust handling of CatBoost Master and Workers failures, avoid freezes.
Fix for empty partitions. #1687
Fix use-after-free. #1759 and other random errors.
Support Spark 3.1.

Python package

Support python 3.10. #1575

Breaking changes

Use group weight for generated pairs in pairwise losses

Bugfixes

Switch to mimalloc allocator on Linux and macOS to avoid problems with static TLS.
Fix SEGFAULTs on macOS. #1877
Fix: Distributed training: do not fail if worker contains only learn or test data
Fix SEGFAULT on CPU with Depthwise training and rsm < 1.
Fix calc_feature_statistics for cat features. #1882
Fix result of Cross Validation if metric_period has been specified
fix eval_metric for Multitarget training

v1.0.0

2 years ago

In this release, we decided to increment the major version as we think that CatBoost is pretty stable and production-ready. We know, that CatBoost is used a lot in many different companies and individual projects, and we think, that all the features we added in the last year are worth incrementing major version. And of course, as many programmers, we love the magic of binary numbers and we want to celebrate 100₂ anniversary since CatBoost first release on Github 🥳

New losses

We've implemented a multi-label multiclass loss function, that allows us to predict multiple labels for each object #1420
Added LogCosh loss implementation #844

Fully distributed CatBoost for Apache Spark

In this release Apache Spark package became truly distributed - in the previous version CatBoost stored test datasets in controller process memory. And now test datasets are split evenly by workers.

Major speedup on CPU

We've improved training speed on numeric datasets:

28% speedup on Higgs dataset: 1000 trees, binclass: on 16 cores Intel CPU: 405 seconds -> 315 seconds
20% speedup on the small numeric dataset with 480K rows, 60 features, 100 trees, binclass on 16 cores Intel CPU 3.7 seconds-> 2.9 seconds
53% speedup on sparse one-hot encoded airlines dataset: 1000 trees training time 381 seconds -> 249 seconds

R package

Update C++ handles by reference to avoid redundant copies by @david-cortes
Avoid calculating groupwise feature importance: do not calculate feature importance for groupwise metrics by default
R tests clear environment after runs so they won't find temporary data from previous runs
Fixed ignored features in R fail when single feature was ignored
Fix feature_count attribute with ignored_features

CV improvements

Added support for text features and embeddings in cross-validation mode
We've changed the way cross-validation works - previously, CatBoost was training a small batch of trees on each fold and then switched to the next fold or next batch of trees. In 1.0.0 we changed this behavior and now CatBoost trains the full model on each fold. That allows us to reduce the memory and time overhead of starting a new batch - only one CPU to GPU memory copy is needed per fold, not per each batch of trees. Mean metric interactive plot became unavailable until the end of training on all folds.
Important change From now on use_best_model and early stopping works independently on each fold, as we are trying to make single fold training as close to regular training as possible. If one model stops at iteration i we use the last value of metric in the mean score plot for points with [i+1; last iteration).

GPU improvements

Fixed distributed training performance on Ethernet networks ~2x training time speedup. For 2 hosts, 8 v100/host, 10gigabit eth, 300 factors, 150m samples, 200 trees, 3300s -> 1700s
We've found a bug in model-size-reg implementation in GPU that leaded to worse quality of the resulting model, especially in comparison to a model trained on CPU with equal parameters

Rust

Enabled load model from the buffer for rust by @manavsah

Bugfixes

Fix for model predictions with text and embedding features
Switch to TBB local executor to limit TLS size and avoid memory leakage #1835
Switch to tcmalloc under Linux x86_64 to avoid memory fragmentation bug in LFAlloc
Fix for case of ignored text feature
Fixed application of baseline in C++ code. Moved addition of that before application of activation functions and determining labels of objects.
Fixes for scikit-learn compatibility validation #1783 and #1785
Fix for thread_count = -1 in set_params(). Issue #1800
Fix potential sigsegv in the model evaluator. Fixes #1809
Fix slow (u)int8 & (u)int16 parsing as catfeatures. Fixes #718
Adjust boost from average option before auto-learning rate
Fix embeddings with CrossEntropy mode #1654
Fix object importance #1820
Fix data provider without target #1827

v0.26.1

2 years ago

R package

Supported text features in R package, thanks to @glemhel!
Supported virtual Ensembles in R, thanks to @glemhel!

New features

Thank @gmrandazzo for adding multiregression with missing values on targets - MultiRMSEWithMissingValues loss function
Supported multiclass prediction in C++ wrapper for model inference C API

Bugfixes

Renamed keyword parameter in predict_proba function from X to data, fixes #1785
R feature importances: remove pool argument, fix #1438 and #1772
Fix CUDA training on Windows, multiple issues. main issue with details #1735
Issue #1728: don't dereference pointers when there is no features
Fixed empty tree processing in feature strength calculation
Fixed missing loss graph points in select_features, #1775
Sort csr matrix indices, fixes #1749
Fix error "active CatBoost worker is already present in the current process" after previous training interruption or failure. #1795.
Fixed erroneous warnings from models validation after training with custom loss or custom error function. Fixes #873 Fixes #1169

v0.26

2 years ago

New features

#972. Add model evaluation on GPU. Thanks to @rakalexandra.
Support Langevin on GPU
Save class labels to models in cross validation
#1524. Return models after CV. Thanks to @vklyukin
[Python] #766. Add CatBoostRanker & pool.get_group_id_hash() for ranking. Thanks to @AnnaAraslanova
#262. Make CatBoost widget work in jupyter lab. Thanks to @Dm17r1y
[GPU only] Allow to add exponent to score aggregation function
Allow to specify threshold parameter for binary classification model. Thanks to @Keksozavr.
[C Model API] #503. Allow to specify prediction type.
[C Model API] #1201. Get predictions for a specific class.

Breaking changes

#1628. Use CUDA 11 by default. CatBoost GPU now requires Linux x86_64 Driver Version >= 450.51.06 Windows x86_64 Driver Version >= 451.82.

Losses and metrics

Add MRR and ERR metrics on CPU.
Add LambdaMart loss.
#1557. Add survivalAFT base logic. Thanks to @blatr.
#1286. Add Cox Proportional Hazards Loss. Thanks to @fibersel.
#1595. Provide object-oriented interface for setting up metric parameters. Thanks to @ks-korovina.
Change default YetiRank decay to 0.85 for better quality.

Python package

#1372. Custom logging stream in python package. Thanks to @DianaArapova.
#1304. Callback after iteration functionality. Thanks to @qoter.

R package

#251. Train parameter synonyms. Thanks to @ebalukova.
#252. Add eval_metrics. Thanks to @ebalukova.

Speedups

[Python] Speed up custom metrics and objectives with numba (if available)
[Python] #1710. Large speedup for cv dataset splitting by sklearn splitter

Other

Use Exact leaves estimation method as default on GPU
[Spark] #1632. Update version of Scala 2.11 for security reasons.
[Python] #1695. Explicitly specify WHEEL 'Root-Is-Purelib' value

Bugfixes

Fix default projection dimension for embeddings
Fix use_weights for some eval_metrics on GPU - use_weights=False is always respected now
[Spark] #1649. The earlyStoppingRounds parameter is not recognized
[Spark] #1650. Error when using the autoClassWeights parameter
[Spark] #1651. Error about "Auto-stop PValue" when using odType "Iter" and odWait
Fix usage of pairlogit weights for CPU fallback metrics when training on GPU

v0.25.1

3 years ago

Speedup

Now CatBoost uses non-owning Numpy arrays for passing c++ data to user-defined metric and loss functions in Python. This opens lot's of speedup probabilities: using those vectors in numba.jitted code, in cython code or just using numpy vector functions. Thanks @micyril!

Bugfixes

Fix #1620 - retrieval of R pointers by @david-cortes
Fix EvalMetricsResult.get_metric() by @Roffild
Fix multiclass AUC calculation #1615

v0.25

3 years ago

CatBoost for Apache Spark

This release includes CatBoost for Apache Spark package that supports training, model application and feature evaluation on Apache Spark platform. We've prepared CatBoost for Apache Spark introduction and CatBoost for Apache Spark Architecture videos for introduction. More details available at CatBoost for Apache Spark home page.

Feature selection

CatBoost supports recursive feature elimination procedure - when you have lot's of feature candidates and you want to select only most influential features by training models and selecting only strongest by feature importance. You can look for details in our tutorial

New features

Supported exact leaves estimation method for quantile, MAE and MAPE losses on GPU. You can enable it by setting leaf_estimation_method=Exact explicitly, in next releases we are planning to set it by default.
Supported uncertainty prediction for multiclassification models
#1568 Added support shap values calculation MultiRMSE models
#1520 Added support for pathlib.Path in python package
#1456 Added prehashed categorical features and text features to C API for model inference.

Losses and metrics

Supported Huber and Tweedie losses in GPU training
QueryAUC metric implemented by @fibersel

Breaking changes

We changed NDCG calculation principle for groups without relevant docs to make our NDCG score fully compatible with XGBoost and LightGBM implementations. Now we calc dcg==1 when there is no relevant objects in group (when ideal DCG equals zero), later we used score==0 in that case.

Speedups

With help of Intel developers team we switched our threading model implementation to Intel Threading Building Blocks. That gives us up to 20% speedup on 28 threads and around 2x speedup when training in 120 threads and largely improves scalability.
Speed up rendering fstat plots.
Slightly speed up string casting in python package during pool creation.

R package

Added path expansion when saving/loading files in R by @david-cortes
Added functionality to restore R handle after deserializing model by @david-cortes
Retrieve R pointers outside loops to speed up scalar access by @david-cortes
Multiple R documentation edits from @david-cortes and @jameslamb
#1588 Added precision for converting params to json

Bugfixes

#1525 Problem with missing exported functions in Windows R package dll
#1315 Low CPU utilization in CPU cross-validation
#785 Predict on single item with iloc fixed by @feeeper
Segfaults due to null pointer in pool in R package fixed by @david-cortes
#1553 Added check for baseline dimensions count in apply
#1606 Allow to use CatBoost in AWS Lambda environment: fix bug with setting thread names
#1609 and #1309 Print proper error message if all params in grid were invalid
Ability to use docstrings in estimators added by @pawelopiela
Allow extra space at the end of line for libsvm format

Thanks!

We would like to recognize Intel software engineering team’s contributions to Catboost project.
Many thanks to our individual contributors: @david-cortes @jameslamb @pawelopiela @feeeper @fibersel

v0.24.4

3 years ago

Release 0.24.4

Speedup

Major speedup asymmetric trees training time on CPU (2x speedup on Epsilon with 16 threads). We would like to recognize Intel software engineering team’s contributions to Catboost project.

New features

From now on we are releasing Python 3.9 wheels. Related issues: #1491, #1509, #1510
Allow boost_from_average for MultiRMSE loss. Issue #1515
Add tag pairwise=False for sklearn compatibility. Fixes issue #1518

Bugfixes:

Allow fstr calculation for datasets with embeddings
Fix feature_importances_ for fstr with texts
Virtual ensebles fix: use proper unshrinkage coefficients
Fixed constants in RMSEWithUnceratainty loss function calculation to correspond values from original paper
Allow shap values calculation for model with zero-weights and non-zero leaf values. Now we use sum of leaf weights on train and current dataset to guarantee non-zero weights for leafs, reachable on current dataset. Fixes issues #1512, #1284

v0.24.3

3 years ago

Release 0.24.3

New functionality

Support fstr text features and embeddings. Issue #1293

Bugfixes:

Fix model apply speed regression introduced in 0.24.1
Different fixes in embeddings support: fixed apply and model serialization, fixed apply on texts and embeddings
Fixed virtual ensembles prediction - use proper scaling, fix apply (issue #1462)
Fix score() method for RMSEWithUncertainty issue #1482
Automatically use correct prediction_type in score()