A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
numpy.ndarray
s with float32
data type multithreaded. Significant speedups of 5x up to 10x (on CPUs with many cores) can be expected. #385, #2542best_score_
, evals_result_
, best_iteration_
model attributes now work after model saving and loading. Can be removed by model metadata manipulation if needed. #1166Class
predictions for models that have been trained with boolean targets will also be boolean instead of True
, False
strings as before. Such models will be incompatible with the previous versions of CatBoost appliers. If you want the old behavior convert your target to False
, True
strings before training. #1954jupyterlab
version for setup to 3.x for now. Fixes #2530utils.read_cd
: Support CD files with non-increasing column indices.log_cout
, log_cerr
specification consistent, avoid reset in recursive calls.log_cout
, log_cerr
. #2195Cox
, PairLogitPairwise
, UserPerObjMetric
, SurvivalAft
.fit
with Pool
arguments) and Class
prediction in Python. #1954Auxiliary
columns by name in evaluation result output. #1659clang-cl
compiler and tools from Visual Studio 2022 for the build without CUDA (build with CUDA still uses standard Microsoft toolchain from Visual Studio 2019).os.version
to conan
host settings to ensure version consistency.-mno-outline-atomics
for modern versions of CLang and GCC to avoid unresolved symbols linking errors. #2527CMakeLists
for unit tests for util
. #2525Pool()
when pairs_weight
is a numpy array. #1913__call__
method. #2277Targets are required for YetiRank loss function.
error in Cross validation. #2083Pool.get_label()
returns constant True
for boolean labels. #2133best_score_
, evals_result_
, best_iteration_
attributes values anymore. #1793Precision
metric default value in the absense of positive samples is changed to 0 and a warning is added
(similar to the behavior of scikit-learn
implementation). #2422Target
data is available.Error: can't proceed some features
error on GPU. #1024allow_const_label=True
for classification. #1933SurvivalAft
objective/metric.eval_metric
in binary python packages of version 1.2.1 on PyPI. #2486mode
parameter. See Which Tricks are Important for Learning to Rank? paper for details (this family of losses is called YetiLoss
there). CPU-only for now.catboost.sample_gaussian_process
function). #2408, thanks to @TakeOver. See Gradient Boosting Performs Gaussian Process Inference paper for details.int
instead of deprecated numpy.int
. #2378ModelCalcerWrapper::CalcFlatTransposed
, #2413 thanks to @faucctCatBoost's build system has been switched from Ya Make (Yandex's build system) to CMake. This means more transparency in the build process and more familiar tools for Open Source developers. For now it is possible to build CatBoost for:
This allowed us to prepare the Python package in the source distribution form (also known as sdist
). #830
msvs
subdirectory with the Microsoft Visual Studio solution has been removed. Visual Studio solutions can be generated using CMake instead.make
subdirectory with Makefiles has been removed. Use CMake
+ ninja
(recommended) or CMake
+ make
instead.setup.py
instead of the custom mk_wheel.py
script. All common scenarios (sdist
, build
, install
, editable install
, bdist_wheel
) are supported.manylinux1
to manylinux2014
.fixed_binary_splits
to the regressor, classifier, and ranker.String
and Vec
types for features to AsRef
of slices to make code more genericbinary-classification-threshold
parameter to the CLI model applier.RMSEWithUncertainty
loss function on GPU.MultiLogloss
and MultiCrossEntropy
loss functions with numerical features on GPU.MultiLogloss
loss function with text features on CPU and GPU. #1885Focal
loss (CPU-only for now). #1807, thanks to @diditforlulz273.MultiLogloss
on CPU by 8% per tree (110K samples, 20 targets, 480 float features, 3 cat features, 16 cores CPU).TFullModel::SetEvaluatorType
(it was possible to get a Segmentation fault when calling it for non-available implementstion). Add TFullModel::GetSupportedEvaluatorTypes
.allow_write_files=True
._get_embedding_feature_indices
. #2273set_feature_names
with text or embedding features. #2090libs/model_interface applier
always produced an error in CUDA mode.catboost/cuda/cuda_util/sort.cpp:166: CUDA error 9
on Nvidia Ampere - based GPUs.utils.eval_metrics
for groupwise metrics when group data has not been specified. #2343P.S. There's an issue with somewhat unexpected binary size increases. We're investingating in #2369
GetModelUsedFeaturesNames
. #2204utils.create_cd
. #2193np.ndarray
with dtype=object
. #2201feature_names
in utils.create_cd
. #2211Multiquantile regression
Now it's possible to train models with shared tree structure and multiple predicted quantile values in each leaf. Currently this approach doesn't give a strong guarantee for predicted quantile values consistency, but it still provides more consistency than training multiple independent models for each quantile. You can read short description in the documentation. Short example for Python: loss_function='MultiQuantile:alpha=0.2,0.4'
. Supported only on CPU for now.
Support text and embedding features for regression and ranking.
Spark: Read/write Spark's Dataset-like API for Pool. #2030
Support HashedCateg column type. This allows to use externally prehashed categorical features both in training and prediction.
New option plot_file
in Python functions with plot
parameter allows to save plots to file. #758
Add eval_fraction parameter. #1500
Non-symmetric trees model summation.
init_model
parameter now works with non-symmetric trees.
Partial support for Apache Spark 3.3 (only for Scala 2.12 and without PySpark).
--fixed-binary-splits
or fixed_binary_splits
in Python package (by default, there are no fixed splits)fit
for PySpark estimators. #1976.MAE
, MAPE
, Quantile
on GPU.BrierScore
. #1967.plot_tree
example in documentation.cv
.sort
param to FilteredDCG
metric.StochasticRank
for FilteredDCG
.loss_function
.calc_feature_statistics
calc_metrics
mode.