Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
The 2.0.3 patch release make the following bug fixes:
Full Changelog: https://github.com/dmlc/xgboost/compare/v2.0.2...v2.0.3
You can verify the downloaded packages by running the following command on your Unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
7c4bd1cf6162d335fd20a8168a54dd11508342f82fbf381a80c02ac57be0bce4 xgboost-2.0.3.tar.gz
d0c3499504133a8ea0043da2974c51cc71aae792f0719080bc227d7add8fb881 xgboost_r_gpu_win64_2.0.3.tar.gz
ee47da5b21231965b1f054d191a5418543377f4ba0d0615a593a6f99d1832ca1 xgboost_r_gpu_linux_2.0.3.tar.gz
Experimental binary packages for R with CUDA enabled
The 2.0.2 patch releases make the following bug fixes:
This is a patch release for bug fixes.
In addition, this is the first release where the JVM package is distributed with native support for Apple Silicon.
You can verify the downloaded packages by running the following command on your Unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
529e9d0f88c2a7abae833f05b7d1e7e7ce01de20481ea60f6ebb6eb7fc96ba69 xgboost.tar.gz
25342c91e7cda98b1362b70282b286c2e4f3e996b518fb590c1303f53f39f188 xgboost_r_gpu_win64_2.0.1.tar.gz
3d8cde1160ab135c393b8092ce0475709dff318024022b735a253d968f9711b3 xgboost_r_gpu_linux_2.0.1.tar.gz
Experimental binary packages for R with CUDA enabled
Source tarball
We are excited to announce the release of XGBoost 2.0. This note will begin by covering some overall changes and then highlight specific updates to the package.
We have been working on vector-leaf tree models for multi-target regression, multi-label classification, and multi-class classification in version 2.0. Previously, XGBoost would build a separate model for each target. However, with this new feature that's still being developed, XGBoost can build one tree for all targets. The feature has multiple benefits and trade-offs compared to the existing approach. It can help prevent overfitting, produce smaller models, and build trees that consider the correlation between targets. In addition, users can combine vector leaf and scalar leaf trees during a training session using a callback. Please note that the feature is still a working in progress, and many parts are not yet available. See #9043 for the current status. Related PRs: (#8538, #8697, #8902, #8884, #8895, #8898, #8612, #8652, #8698, #8908, #8928, #8968, #8616, #8922, #8890, #8872, #8889, #9509) Please note that, only the hist
(default) tree method on CPU can be used for building vector leaf trees at the moment.
device
parameter.A new device
parameter is set to replace the existing gpu_id
, gpu_hist
, gpu_predictor
, cpu_predictor
, gpu_coord_descent
, and the PySpark specific parameter use_gpu
. Onward, users need only the device
parameter to select which device to run along with the ordinal of the device. For more information, please see our document page (https://xgboost.readthedocs.io/en/stable/parameter.html#general-parameters) . For example, with device="cuda", tree_method="hist"
, XGBoost will run the hist
tree method on GPU. (#9363, #8528, #8604, #9354, #9274, #9243, #8896, #9129, #9362, #9402, #9385, #9398, #9390, #9386, #9412, #9507, #9536). The old behavior of gpu_hist
is preserved but deprecated. In addition, the predictor
parameter is removed.
hist
is now the default tree methodStarting from 2.0, the hist
tree method will be the default. In previous versions, XGBoost chooses approx
or exact
depending on the input data and training environment. The new default can help XGBoost train models more efficiently and consistently. (#9320, #9353)
There's initial support for using the approx
tree method on GPU. The performance of the approx
is not yet well optimized but is feature complete except for the JVM packages. It can be accessed through the use of the parameter combination device="cuda", tree_method="approx"
. (#9414, #9399, #9478). Please note that the Scala-based Spark interface is not yet supported.
XGBoost has a new parameter max_cached_hist_node
for users to limit the CPU cache size for histograms. It can help prevent XGBoost from caching histograms too aggressively. Without the cache, performance is likely to decrease. However, the size of the cache grows exponentially with the depth of the tree. The limit can be crucial when growing deep trees. In most cases, users need not configure this parameter as it does not affect the model's accuracy. (#9455, #9441, #9440, #9427, #9400).
Along with the cache limit, XGBoost also reduces the memory usage of the hist
and approx
tree method on distributed systems by cutting the size of the cache by half. (#9433)
There is some exciting development around external memory support in XGBoost. It's still an experimental feature, but the performance has been significantly improved with the default hist
tree method. We replaced the old file IO logic with memory map. In addition to performance, we have reduced CPU memory usage and added extensive documentation. Beginning from 2.0.0, we encourage users to try it with the hist
tree method when the memory saving by QuantileDMatrix
is not sufficient. (#9361, #9317, #9282, #9315, #8457)
We created a brand-new implementation for the learning-to-rank task. With the latest version, XGBoost gained a set of new features for ranking task including:
lambdarank_pair_method
for choosing the pair construction strategy.lambdarank_num_pair_per_sample
for controlling the number of samples for each group.lambdarank_unbiased
parameter.NDCG
using the ndcg_exp_gain
parameter.NDCG
is now the default objective function.XGBRanker
.For more information, please see the tutorial. Related PRs: (#8771, #8692, #8783, #8789, #8790, #8859, #8887, #8893, #8906, #8931, #9075, #9015, #9381, #9336, #8822, #9222, #8984, #8785, #8786, #8768)
In the previous version, base_score
was a constant that could be set as a training parameter. In the new version, XGBoost can automatically estimate this parameter based on input labels for optimal accuracy. (#8539, #8498, #8272, #8793, #8607)
The XGBoost algorithm now supports quantile regression, which involves minimizing the quantile loss (also called "pinball loss"). Furthermore, XGBoost allows for training with multiple target quantiles simultaneously with one tree per quantile. (#8775, #8761, #8760, #8758, #8750)
Both objectives use adaptive trees due to the lack of proper Hessian values. In the new version, XGBoost can scale the leaf value with the learning rate accordingly. (#8866)
Using the Python or the C package, users can export the quantile values (not to be confused with quantile regression) used for the hist
tree method. (#9356)
We made progress on column-based split for federated learning. In 2.0, both approx
, hist
, and hist
with vector leaf can work with column-based data split, along with support for vertical federated learning. Work on GPU support is still on-going, stay tuned. (#8576, #8468, #8442, #8847, #8811, #8985, #8623, #8568, #8828, #8932, #9081, #9102, #9103, #9124, #9120, #9367, #9370, #9343, #9171, #9346, #9270, #9244, #8494, #8434, #8742, #8804, #8710, #8676, #9020, #9002, #9058, #9037, #9018, #9295, #9006, #9300, #8765, #9365, #9060)
After the initial introduction of the PySpark interface, it has gained some new features and optimizations in 2.0.
use_gpu
is deprecated. The device
parameter is preferred.Here's a list of new features that don't have their own section and yet are general to all language bindings.
These optimizations are general to all language bindings. For language-specific optimization, please visit the corresponding sections.
array_interface
on CPU (like numpy
) is significantly improved. (#9090)Other than the aforementioned change with the device
parameter, here's a list of breaking changes affecting all packages.
numpy.ndarray
instead of relying on text inputs. See https://github.com/dmlc/xgboost/issues/9472 for more info.Some noteworthy bug fixes that are not related to specific language bindings are listed in this section.
inf
is checked during data construction. (#8911)updater
parameter is used instead of the tree_method
parameter (#9355)\t\n
in feature names for JSON model dump. (#9474)~
on Unix (#9463). In addition, all path inputs are required to be encoded in UTF-8 (#9448, #9443)Aside from documents for new features, we have many smaller updates to improve user experience, from troubleshooting guides to typo fixes.
plot_importance
plot (#8540)__half
type, and no data copy is made. (#8487, #9207, #8481)Series
and Python primitive types in inplace_predict
and QuantileDMatrix
(#8547, #8542)sample_weight
. (#8706)xgboost.dask.train
(#9421)QuantileDMatrix
for efficiency. (#8666, #9445)setup.py
is now replaced with the new configuration file pyproject.toml
. Along with this, XGBoost now supports Python 3.11. (#9021, #9112, #9114, #9115) Consult the latest documentation for the updated instructions to build and install XGBoost.DataIter
now accepts only keyword arguments. (#9431)DaskXGBClassifier.classes_
to an array (#8452)best_iteration
only if early stopping is used to be consistent with documented behavior. (#9403)device
parameter section, the predictor
parameter is now removed. (#9129)save_model
call for the scikit-learn interface. (#8963)ntree_limit
in the python package. This has been deprecated in previous versions. (#8345)black
and isort
for code formatting (#8420, #8748, #8867)enable_categorical
to True in predict. (#8592)NA
. (#9522)Following are changes specific to various JVM-based packages.
ResultStage
to ShuffleMapStage
(#9423)Revised support for flink
(#9046)
Breaking changes
DeviceQuantileDmatrix
into QuantileDMatrix
(#8461)Maintenance (#9253, #9166, #9395, #9389, #9224, #9233, #9351, #9479)
CI bot PRs We employed GitHub dependent bot to help us keep the dependencies up-to-date for JVM packages. With the help from the bot, we have cleared up all the dependencies that are lagging behind (#8501, #8507).
Here's a list of dependency update PRs including those made by dependent bots (#8456, #8560, #8571, #8561, #8562, #8600, #8594, #8524, #8509, #8548, #8549, #8533, #8521, #8534, #8532, #8516, #8503, #8531, #8530, #8518, #8512, #8515, #8517, #8506, #8504, #8502, #8629, #8815, #8813, #8814, #8877, #8876, #8875, #8874, #8873, #9049, #9070, #9073, #9039, #9083, #8917, #8952, #8980, #8973, #8962, #9252, #9208, #9131, #9136, #9219, #9160, #9158, #9163, #9184, #9192, #9265, #9268, #8882, #8837, #8662, #8661, #8390, #9056, #8508, #8925, #8920, #9149, #9230, #9097, #8648, #9203, #8593).
Maintenance work includes refactoring, fixing small issues that don't affect end users. (#9256, #8627, #8756, #8735, #8966, #8864, #8747, #8892, #9057, #8921, #8949, #8941, #8942, #9108, #9125, #9155, #9153, #9176, #9447, #9444, #9436, #9438, #9430, #9200, #9210, #9055, #9014, #9004, #8999, #9154, #9148, #9283, #9246, #8888, #8900, #8871, #8861, #8858, #8791, #8807, #8751, #8703, #8696, #8693, #8677, #8686, #8665, #8660, #8386, #8371, #8410, #8578, #8574, #8483, #8443, #8454, #8733)
You can verify the downloaded packages by running the following command on your Unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
de3a56c3d08a818bc1ea90c0476e28b937e10e0736b3ed4e27e22b43e8072ec1 xgboost-2.0.0.tar.gz
a23d965005e494ad9147cfaed1153e52ae238a8ad03ae9aa9aed83526ce7e150 xgboost_r_gpu_win64_2.0.0.tar.gz
c1a633a02cd7de14701b7814e9d81220716592d1891a33e265e76e54ce0e8e11 xgboost_r_gpu_linux_2.0.0.tar.gz
Experimental binary packages for R with CUDA enabled
Source tarball
Roadmap: https://github.com/dmlc/xgboost/projects/2 Release note: https://github.com/dmlc/xgboost/pull/9484 Release status: https://github.com/dmlc/xgboost/issues/9497
This is a patch release for bug fixes. The CRAN package for the R binding is kept at 1.7.5.
QuantileDMatrix
. (#9096)You can verify the downloaded packages by running the following command on your Unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
0a54300dd274b98b7f039acffa006bec4875dace041fd9288422306fe7c379ca xgboost.tar.gz
990fb3c54be7ce53365389f2eb82ce3c1f2e78735b4605ddd2ddb0d47a15d3c3 xgboost_r_gpu_linux_1.7.6.tar.gz
a48fc64bce774bb76eddade6dc6df1d4fc25199a0c17dc66cdfa50cedd3282ad xgboost_r_gpu_win64_1.7.6.tar.gz
Experimental binary packages for R with CUDA enabled
Source tarball Link in GitHub release assets
This is a patch release for bug fixes.
You can verify the downloaded packages by running the following command on your Unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
69a8cf4958e2cea5d492948968d765b856f60d336fbd4367d8176de95898ad7a xgboost.tar.gz
0098f8d1cf5646d75c7d9dafa7e11b8d57441384f86a004b181cd679ef9677d1 xgboost_r_gpu_linux_1.7.5.tar.gz
a23b9744fcff8b53325604935b239c4cfef8a047ca5f4e57ea2b1011382314ee xgboost_r_gpu_win64_1.7.5.tar.gz
Experimental binary packages for R with CUDA enabled
Source tarball Link in GitHub release assets
This is a patch release for bug fixes.
xgboost_r_gpu_win64_1.7.4.tar.gz: Download
This is a patch release for bug fixes.
get_params
no longer returns internally configured values. (#8634)You can verify the downloaded packages by running the following command on your Unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
0b6aa86b93aec2b3e7ec6f53a696f8bbb23e21a03b369dc5a332c55ca57bc0c4 xgboost.tar.gz
This is a patch release for bug fixes.
Work with newer thrust and libcudacxx (#8432)
Support null value in CUDA array interface namespace. (#8486)
Use getsockname
instead of SO_DOMAIN
on AIX. (#8437)
[pyspark] Make QDM optional based on a cuDF check (#8471)
[pyspark] sort qid for SparkRanker. (#8497)
[dask] Properly await async method client.wait_for_workers. (#8558)
[R] Fix CRAN test notes. (#8428)
[doc] Fix outdated document [skip ci]. (#8527)
[CI] Fix github action mismatched glibcxx. (#8551)
You can verify the downloaded packages by running this on your Unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
15be5a96e86c3c539112a2052a5be585ab9831119cd6bc3db7048f7e3d356bac xgboost_r_gpu_linux_1.7.2.tar.gz
0dd38b08f04ab15298ec21c4c43b17c667d313eada09b5a4ac0d35f8d9ba15d7 xgboost_r_gpu_win64_1.7.2.tar.gz