A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
MultiRMSE
loss function.group_weight
parameter added to catboost.utils.eval_metric
method to allow passing weights for object groups. Allows correctly match weighted ranking metrics computation when group weights present.Pool
constructor or fit
function with embedding_features=['EmbeddingFeaturesColumnName1, ...]
parameter. Another way of adding your embedding vectors is new type of column in Column Description file NumVector
and adding semicolon separated embeddings column to your XSV file: ClassLabel\t0.1;0.2;0.3\t...
.use_weights
for metrics when auto_class_weights
parameter is set.plot_predictions
function.average
parameter is passed to TotalF1 metric while training on GPU.Main feature of this release is total uncertainty prediction support via virtual ensembles.
You can read the theoretical background in the preprint Uncertainty in Gradient Boosting via Ensembles from our research team.
We introduced new training parameter posterior_sampling
, that allows to estimate total uncertainty.
Setting posterior_sampling=True
implies enabling Langevin boosting, setting model_shrink_rate
to 1/(2*N)
and setting diffusion_temperature
to N
, where N
is dataset size.
CatBoost object method virtual_ensembles_predict
splits model into virtual_ensembles_count
submodels.
Calling model.virtual_ensembles_predict(.., prediction_type='TotalUncertainty')
returns mean prediction, variance (and knowledge uncertrainty for models, trained with RMSEWithUncertainty
loss function).
Calling model.virtual_ensembles_predict(.., prediction_type='VirtEnsembles')
returns virtual_ensembles_count
predictions of virtual submodels for each object.
n_features_in_
attribute required for using CatBoost in sklearn pipelines. Issue #1363load_model(blob=b'....')
, to deserialize form file-like stream use load_model(stream=gzip.open('model.cbm.gz', 'rb'))
RMSEWithUncertainty
- it allows to estimate data uncertainty for trained regression models. The trained model will give you a two-element vector for each object with the first element as regression model prediction and the second element as an estimation of data uncertainty for that prediction.model.feature_names_
. Issue #1314model_sum()
or as the base model in init_model=
. Issue #1271plot_partial_dependence
method in python-package (Now it works for models with symmetric trees trained on dataset with numerical features only). Implemented by @felixandrer.boost_from_average
option together with model_shrink_rate
option. In this case shrinkage is applied to the starting value..auto_class_weights
option in python-package, R-package and cli with possible values Balanced
and SqrtBalanced
. For Balanced
every class is weighted maxSumWeightInClass / sumWeightInClass
, where sumWeightInClass is sum of weights of all samples in this class. If no weights are present then sample weight is 1. And maxSumWeightInClass - is maximum sum weight among all classes. For SqrtBalanced
the formula is sqrt(maxSumWeightInClass / sumWeightInClass)
. This option supported in binclass and multiclass tasks. Implemented by @egiby.model_size_reg
option on GPU. Set to 0.5 by default (same as in CPU). This regularization works slightly differently on GPU: feature combinations are regularized more aggressively than on CPU. For CPU cost of a combination is equal to number of different feature values in this combinations that are present in training dataset. On GPU cost of a combination is equal to number of all possible different values of this combination. For example, if combination contains two categorical features c1 and c2, then the cost will be #categories in c1 * #categories in c2, even though many of the values from this combination might not be present in the dataset.catboost.utils.convert_to_onnx_object
method. Implemented by @monkey0headTotalF1
metric CatBoost will print TotalF1:average=Weighted
as corresponding metric column header in error logs. Implemented by @ivanychevclass_weights
parameter accepts dictionary with class name to class weight mapping_get_tags()
method for compatibility with sklearn (issue #1282). Implemented by @crazylegloss_function
param in python cv
method.catboost.utils.quantize
function to create quantized Pool
this way. See usage example in the issue #1116.
Implemented by @noxwell.save_quantization_borders
method that allows to save resulting borders into a file and use it for quantization of other datasets. Quantization can be a bottleneck of training, especially on GPU. Doing quantization once for several trainings can significantly reduce running time. It is recommended for large dataset to perform quantization first, save quantization borders, use them to quantize validation dataset, and then use quantized training and validation datasets for further training.
Use saved borders when quantizing other Pools by specifying input_borders
parameter of the quantize
method.
Implemented by @noxwell.border_count
> 255 for GPU training. This might be useful if you have a "golden feature", see docs.feature_weights="FeatureName1:1.5,FeatureName2:0.5"
.
Scores for splits with this features will be multiplied by corresponding weights.
Implemented by @Taube03.first_use_feature_penalties
.
This parameter penalized the first usage of a feature. This should be used in case if the calculation of the feature is costly.
The penalty value (or the cost of using a feature) is subtracted from scores of the splits of this feature if feature has not been used in the model.
After the feature has been used once, it is considered free to proceed using this feature, so no substruction is done.
There is also a common multiplier for all first_use_feature_penalties
, it can be specified by penalties_coefficient
parameter.
Implemented by @Taube03 (issue #1155)recordCount
attribute is added to PMML models (issue #1026).Tweedie
loss is supported now. It can be a good solution for right-skewed target with many zero values, see tutorial.
When using CatBoostRegressor.predict
function, default prediction_type
for this loss will be equal to Exponent
. Implemented by @ilya-pchelintsev (issue #577)proba_border
. With this parameter you can set decision boundary for treating prediction as negative or positive. Implemented by @ivanychev.TotalF1
supports a new parameter average
with possible value weighted
, micro
, macro
. Implemented by @ilya-pchelintsev.eval_metric
. It is not possible to used it as an optimization objective.
To write a multi-label metric, you need to define a python class which inherits from MultiLabelCustomMetric
class. Implemented by @azikmsu.class_weights
parameter is now supported in grid/randomized search. Implemented by @vazgenk.get_best_score
returns train/validation best score after grid/randomized search (in case of refit=False). Implemented by @rednevaler.CatBoost.get_feature_importance
to get a matrix of SHAP values for every prediction.
By default, SHAP interaction values are calculated for all features. You may specify features of interest using the interaction_indices
argument.
Implemented by @IvanKozlov98.shap_calc_type
parameter of CatBoost.get_feature_importance
function as "Approximate"
. Implemented by @LordProtoss (issue #1146).PredictionDiff
model analysis method can now be used with models that contain non symmetric trees. Implemented by @felixandrer.CatBoostRegressor.predict
function for models trained with Poisson
loss, default prediction_type
will be equal to Exponent
(issue #1184). Implemented by @garkavem.This release also contains bug fixes and performance improvements, including a major speedup for sparse data on GPU.
grow_policy
parameter.
Starting from this release non symmetric trees are supported for both CPU and GPU training.to_regressor
and to_classifier
methods.The release also contains a list of bug fixes.
langevin
option and tune diffusion_temperature
and model_shrink_rate
. See the corresponding paper for details.Logloss
objective, but also for RMSE
(on CPU and GPU) and MultiClass
(on GPU).classes_
attribute and for prediction functions with prediction_type=Class
. #305, #999, #1017.
Note: Class labels loaded from datasets in CatBoost dsv format always have string type now.boost_from_average=True
. #1125catboost.get_feature_importance
did not work after model is loaded #1064catboost.train
did not work when called with the single dataset parameter. #1162classes_count
and class_weight
params can be now used with user-defined loss functions. #1119use_weights
gets value by default. #1106model.classes_
attribute for binary classification (proper labels instead of always 0
and 1
). #984model.classes_
attribute when classes_count parameter was specified.leaf_estimation_method=Exact
the default for MAPE lossCatBoostClassifier.predict_log_proba()
, PR #1095get_feature_importance
, PR #1090boost_from_average
mode