HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Lock-free Inference Cache in HPS
dynamic
and "use_hctr_cache_implementation"
to false
.Official SOK Release
experiment
package anymore but is now officially supported by HugeCTR. Do import sparse_operation_kit as sok
instead of from sparse_operation_kit import experiment as sok
sok.DynamicVariable
supports Merlin-HKV as its backendCode Cleaning and Deprecation
Model::export_predictions
function. Use the Model::check_out_tensor function instead.Norm
and legacy Raw
DataReaders. Use hugectr.DataReaderType_t.RawAsync
or hugectr.DataReaderType_t.Parquet
as their alternatives.Issues Fixed:
Known Issues:
If we set max_eval_batches
and batchsize_eval
to some large values such as 5000 and 12000 respectively, the training process leads to the illegal memory access error. The issue is from the CUB, and is fixed in its latest version. However, because it is only included in CUDA 12.3, which is not used by our NGC container yet, until we update our NGC container to rely upon that version of CUDA, please rebuild HugeCTR with the newest CUB as a workaround. Otherwise, please try to avoid such large max_eval_batches
and batchsize_eval
.
HugeCTR can lead to a runtime error if client code calls RMM’s rmm::mr::set_current_device_resource()
or rmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also calls rmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variable HCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they know rmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
Code Cleaning and Deprecation
General Updates:
-DUSE_CUDART_STATIC=ON
in configuring the code with cmake, while the dynamic CUDA runtime is still used by default.Issues Fixed:
num_slots
Known Issues:
If we set max_eval_batches
and batchsize_eval
to some large values such as 5000 and 12000 respectively, the training process leads to the illegal memory access error. The issue is from the CUB, and is fixed in its latest version. However, because it is only included in CUDA 12.3, which is not used by our NGC container yet, until we update our NGC container to rely upon that version of CUDA, please rebuild HugeCTR with the newest CUB as a workaround. Otherwise, please try to avoid such large max_eval_batches
and batchsize_eval
.
HugeCTR can lead to a runtime error if client code calls RMM’s rmm::mr::set_current_device_resource()
or rmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also calls rmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variable HCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they know rmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
Hierarchical Parameter Server:
HugeCTR Training & SOK:
_concat_
as the combiner. For more information, please refer to dense_embedding.py.DeviceSegmentedSort
with DeviceSegmentedRadixSort
General Updates:
-DUSE_CUDART_STATIC=ON
in cmak'ingKnown Issues:
HugeCTR can lead to a runtime error if client code calls RMM’s rmm::mr::set_current_device_resource()
or rmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also calls rmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variable HCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they know rmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
In this release, we have fixed issues and enhanced the code.
3G Embedding Updates:
DataDistributor
related codeload()
and dump()
APIs are usable in TensorFlow 2. To use the API, specify sok_vars
in addition to path
.sok_vars
is a list of sok.variable
and/or sok.dynamic_variable
.m
and v
of Adam
, the optimizer
must be specified as well.optimizer
must be a tf.keras.optimizers.Optimizer
or sok.OptimizerWrapper
while their underlying type must be SGD
, Adamax
, Adadelta
, Adagrad
, or Ftrl
.import sparse_operation_kit as sok
sok.load(path, sok_vars, optimizer=None)
sok.dump(path, sok_vars, optimizer=None)
These APIs are independent from the number of GPUs in use and the sharding strategy. For instance, a distributed embedding table trained and dumped with 8 GPUs can be loaded to train on a 4-GPU machine.
Issues Fixed:
cudaDeviceSynchronize()
is removed when building the HugeCTR in the debug mode, so you can enable the CUDA Graph even in the debug mode.EmbeddingTableCollection
utest to run correctly with multiple GPUsKnown Issues:
HugeCTR can lead to a runtime error if client code calls RMM’s rmm::mr::set_current_device_resource()
or rmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also calls rmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, set an environment variable HCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they know rmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka,make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
In this release, we have fixed issues and enhanced the code.
3G Embedding Updates:
DataDistributor
related codeload()
and dump()
APIs are usable in TensorFlow 2. To use the API, specify sok_vars
in addition to path
.sok_vars
is a list of sok.variable
and/or sok.dynamic_variable
.m
and v
of Adam
, the optimizer
must be specified as well.optimizer
must be a tf.keras.optimizers.Optimizer
or sok.OptimizerWrapper
while their underlying type must be SGD
, Adamax
, Adadelta
, Adagrad
, or Ftrl
.import sparse_operation_kit as sok
sok.load(path, sok_vars, optimizer=None)
sok.dump(path, sok_vars, optimizer=None)
These APIs are independent from the number of GPUs in use and the sharding strategy. For instance, a distributed embedding table trained and dumped with 8 GPUs can be loaded to train on a 4-GPU machine.
Issues Fixed:
cudaDeviceSynchronize()
is removed when building the HugeCTR in the debug mode, so you can enable the CUDA Graph even in the debug mode.EmbeddingTableCollection
utest to run correctly with multiple GPUsKnown Issues:
HugeCTR can lead to a runtime error if client code calls RMM’s rmm::mr::set_current_device_resource()
or rmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also calls rmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, set an environment variable HCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they know rmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka,make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
Hierarchical Parameter Server Enhancements:
HPS Table Fusion: From this release, you can fuse tables of the same embedding vector size in HPS. We support this feature in the HPS plugin for TensorFlow and the Triton backend for HPS.. To turn on table fusion, set fuse_embedding_table
to true
in the HPS JSON file. This feature requires that the key values in different tables do not overlap and the embedding lookup layers are not dependent on each other in the model graph. For more information, refer to HPS configuration and HPS table fusion demo notebook. This feature can reduce the embedding lookup latency significantly when there are multiple tables and GPU embedding cache is employed. About 3x speedup is achieved on V100 for the fused case demonstrated in the notebook compared to the unfused one.
UVM Support: We have upgraded the static embedding solution. For embedding tables whose size exceeds the device memory, we will save high-frequency embeddings in the HBM as an embedding cache and offload the remaining embeddings to the UVM. Compared with the dynamic cache solution that offloads the remaining embeddings to the Volatile DB, the UVM solution has higher CPU lookup throughput. We will support online updating of the UVM solution in a future release. Users can switch between different embedding cache solutions through the embedding_cache_type configuration parameter.
Triton Perf Analayzer’s Request Generator: We have added an inference request generator to generate the JSON request format required by Triton Perf Analyzer. By using this request generator together with the model generator, you can use the Triton Perf Analyzer to profile the HPS performance and do stress testing. For API documentation and demo usage, please refer to README
General Updates:
hugectr DenseLayerComputeConfig
to hugectr.DenseLayer
for configuring the computing behavior. The knob for enabling asynchronous weight gradient computations has been moved from hugectr.CreateSolver
to hugectr.DenseLayerComputeConfig.async_wgrad
. The knob for controlling the fusion mode of weight gradients and bias gradients has been moved from hugectr.DenseLayerSwitchs
to hugectr.DenseLayerComputeConfig.fuse_wb
.DSM=90
), so that it can run on Hopper architectures. Note that our NGC container does not support the compute capability yet. Users who are unfamiliar with how to build HugeCTR can refer to the HugeCTR Contribution Guide.CommunicationType.IB_NVLink_Hier
in HybridEmbeddingParams, the RoCE is supported. We have also added 2 environment variables HUGECTR_ROCE_GID
and HUGECTR_ROCE_TC
so that a user can control the RoCE NIC's GID and traffic class.
https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#hybridembeddingparam-class
Documentation Updates:
is_exclusive_keys
to enable potencial acceleration if a user has already preprocessed the input of embedding collection to make the resulting tables exclusive with one another. We have also added the nob comm_strategy
in embedding collection for user to configure optimized communication strategy in multi-node trainingIssues Fixed:
SparseParam
from being misused.Known Issues:
HugeCTR can lead to a runtime error if client code calls RMM’s rmm::mr::set_current_device_resource()
or rmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also calls rmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variable HCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they know rmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
HPS Enhancements:
Google Cloud Storage (GCS) Support:
Issues Fixed:
wdl_prediction
notebook.Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.