HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.3`.
Afterward, the library will use calendar versioning only, such as `v23.01`.
Support for BERT and Variants: This release includes support for BERT in HugeCTR. The documentation includes updates to the MultiHeadAttention layer and adds documentation for the SequenceMask layer. For more information, refer to the samples/bst directory of the repository in GitHub.
HPS Plugin for TensorFlow integration with TensorFlow-TensorRT (TF-TRT): This release includes plugin support for integration with TensorFlow-TensorRT. For sample code, refer to the Deploy SavedModel using HPS with Triton TensorFlow Backend notebook.
Deep & Cross Network Layer version 2 Support: This release includes support for Deep & Cross Network version 2. For conceptual information, refer to https://arxiv.org/abs/2008.13535. The documentation for the MultiCross Layer is updated.
Enhancements to Hierarchical Parameter Server:
enable_tls
, tls_ca_certificate
, tls_client_certificate
, tls_client_key
, and tls_server_name_identification
parameters.Support for New Optimizers:
test_embedding_table_optimizer.cpp
file in the test/utest/embedding_collection/ directory of the repository on GitHub.Data Reading from S3 for Offline Inference: In addition to reading during training, HugeCTR now supports reading data from remote file systems such as HDFS and S3 during offline inference by using the DataSourceParams API. The HugeCTR Training and Inference with Remote File System Example is updated to demonstrate the new functionality.
Documentation Enhancements:
Issues Fixed:
Known Issues:
HugeCTR can lead to a runtime error if client code calls the RMM rmm::mr::set_current_device_resource()
method or rmm::mr::set_current_device_resource()
method.
The error is due to the Parquet data reader in HugeCTR also calling rmm::mr::set_current_device_resource()
.
As a result, the device becomes visible to other libraries in the same process.
Refer to GitHub issue #356 for more information.
As a workaround, you can set environment variable HCTR_RMM_SETTABLE
to 0
to prevent HugeCTR from setting a custom RMM device resource, if you know that rmm::mr::set_current_device_resource()
is called by client code other than HugeCTR.
But be cautious because the setting can reduce the performance of Parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue #243.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.3`.
Afterward, the library will use calendar versioning only, such as `v23.01`.
Support for BERT and Variants: This release includes support for BERT in HugeCTR. The documentation includes updates to the MultiHeadAttention layer and adds documentation for the SequenceMask layer. For more information, refer to the samples/bst directory of the repository in GitHub.
HPS Plugin for TensorFlow integration with TensorFlow-TensorRT (TF-TRT): This release includes plugin support for integration with TensorFlow-TensorRT. For sample code, refer to the Deploy SavedModel using HPS with Triton TensorFlow Backend notebook.
Deep & Cross Network Layer version 2 Support: This release includes support for Deep & Cross Network version 2. For conceptual information, refer to https://arxiv.org/abs/2008.13535. The documentation for the MultiCross Layer is updated.
Enhancements to Hierarchical Parameter Server:
enable_tls
, tls_ca_certificate
, tls_client_certificate
, tls_client_key
, and tls_server_name_identification
parameters.Support for New Optimizers:
test_embedding_table_optimizer.cpp
file in the test/utest/embedding_collection/ directory of the repository on GitHub.Data Reading from S3 for Offline Inference: In addition to reading during training, HugeCTR now supports reading data from remote file systems such as HDFS and S3 during offline inference by using the DataSourceParams API. The HugeCTR Training and Inference with Remote File System Example is updated to demonstrate the new functionality.
Documentation Enhancements:
Issues Fixed:
Known Issues:
HugeCTR can lead to a runtime error if client code calls the RMM rmm::mr::set_current_device_resource()
method or rmm::mr::set_current_device_resource()
method.
The error is due to the Parquet data reader in HugeCTR also calling rmm::mr::set_current_device_resource()
.
As a result, the device becomes visible to other libraries in the same process.
Refer to GitHub issue #356 for more information.
As a workaround, you can set environment variable HCTR_RMM_SETTABLE
to 0
to prevent HugeCTR from setting a custom RMM device resource, if you know that rmm::mr::set_current_device_resource()
is called by client code other than HugeCTR.
But be cautious because the setting can reduce the performance of Parquet reading.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue #243.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as `v4.2`.
Afterward, the library will use calendar versioning only, such as `v23.01`.
Change to HPS with Redis or Kafka:
This release includes a change to Hierarchical Parameter Server and affects deployments that use RedisClusterBackend
or model parameter streaming with Kafka.
A third-party library that was used for HPS partition selection algorithm is replaced to improve performance.
The new algorithm can produce different partition assignments for volatile databases.
As a result, volatile database backends that retain data between application startup, such as the RedisClusterBackend
, must be reinitialized.
Model streaming with Kafka is equally affected.
To avoid issues with updates, reset all respective queue offsets to the end_offset
before you reinitialize the RedisClusterBackend
.
Enhancements to the Sparse Operation Kit in DeepRec:
This release includes updates to the Sparse Operation Kit to improve the performance of the embedding variable lookup operation in DeepRec.
The API for the lookup_sparse()
function is changed to remove the hotness
argument.
The lookup_sparse()
function is enhanced to calculate the number of non-zero elements dynamically.
For more information, refer to the sparse_operation_kit directory of the DeepRec repository in GitHub.
Enhancements to 3G Embedding: This release includes the following enhancements to 3G embedding:
EmbeddingPlanner
class is replaced with the EmbeddingCollectionConfig
class.
For examples of the API, see the tests in the test/embedding_collection_test directory of the repository in GitHub.Model.embedding_dump(path: str, table_names: list[str])
and Model.embedding_load(path: str, list[str])
.
The path
argument is a directory in file system that you can dump weights to or load weights from.
The table_names
argument is a list of embedding table names as strings.New Volatile Database Type for HPS:
This release adds a db_type
value of multi_process_hash_map
to the Hierarchical Parameter Server.
This database type supports sharing embeddings across process boundaries by using shared memory and the /dev/shm
device file.
Multiple processes running HPS can read and write to the same hash map.
For an example, refer to the Hierarchcal Parameter Server Demo notebook.
Enhancements to the HPS Redis Backend: In this release, the Hierarchical Parameter Server can open multiple connections in parallel to each Redis node. This enhancement enables HPS to take advantage of overlapped processing optimizations in the I/O module of Redis servers. In addition, HPS can now take advantage of Redis hash tags to co-locate embedding values and metadata. This enhancement can reduce the number of accesses to Redis nodes and the number of per-node round trip communications that are needed to complete transactions. As a result, the enhancement increases the insertion performance.
MLPLayer is New:
This release adds an MLP layer with the hugectr.Layer_t.MLP
class.
This layer is very flexible and makes it easier to use a group of fused fully-connected layers and enable the related optimizations.
For each fused fully-connected layer in MLPLayer
, the output dimension, bias, and activation function are all adjustable.
MLPLayer supports FP32, FP16 and TF32 data types.
For an example, refer to the dgx_a100_mlp.py in the samples/dlrm
directory of the GitHub repository to learn how to use the layer.
Sparse Operation Kit installable from PyPi:
Version 1.1.4
of the Sparse Operation Kit is installable from PyPi in the merlin-sok package.
Multi-task Model Support added to the ONNX Model Converter:
This release adds support for multi-task models to the ONNX converter.
This release also includes an enhancement to the preprocess_census.py script in samples/mmoe
directory of the GitHub repository.
Issues Fixed:
MirroredStrategy
and running the Hierarchical Parameter Server Demo notebook triggered an issue with ReplicaContext and caused a crash.
The issue is fixed and resolves GitHub issue #362.samples/din/utils
directory of the GitHub repository is updated to use the latest NVTabular API.
This update resolves GitHub issue #364.samples/dlrm
directory of the GitHub repository is fixed.lookup_fromdlpack()
method is fixed.
The error was related to calculating the number of keys and vectors from the corresponding DLPack tensors.io_alignment
value that is smaller than the block device sector size is fixed.
Now, if the specified io_alignment
value is smaller than the block device sector size, io_alignment
is automatically set to the block device sector size.Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
Simplified Interface for 3G Embedding Table Placement Strategy:
3G embedding now provides an easier way for you to configure an embedding table placement strategy.
Instead of using JSON, you can configure the embedding table placement strategy by using function arguments.
You only need to provide the shard_matrix
, table_group_strategy
, and table_placement_strategy
arguments.
With these arguments, 3G embedding can group different tables together and place them according to the shard_matrix
argument.
For an example, refer to dlrm_train.py file in the test/embedding_collection_test
directory of the repository on GitHub.
For comparison, refer to the same file from the v4.0 branch of the repository.
New MMoE and Shared-Bottom Samples:
This release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation.
For more information, refer to the README.md
, mmoe_parquet.py
, and other files in the samples/mmoe
directory of the repository on GitHub.
This release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE.
Support for AWS S3 File System:
The Parquet DataReader can now read datasets from the Amazon Web Services S3 file system.
You can also load and dump models from and to S3 during training.
The documentation for the DataSourceParams
class is updated.
To view sample code, refer to the HugeCTR Training with Remote File System Example class is updated.
Simplication for File System Usage:
You no longer ’t need to pass DataSourceParams
for model loading and dumping.
The FileSystem
class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model.
For example, the path hdfs://localhost:9000/
is inferred as an HDFS file system and the path https://mybucket.s3.us-east-1.amazonaws.com/
is inferred as an S3 file system.
Support for Loading Models from Remote File Systems to HPS:
This release enables you to load models from HDFS and S3 remote file systems to HPS during inference.
To use the new feature, specify an HDFS for S3 path URI in InferenceParams
.
Support for Exporting Intermediate Tensor Values into a Numpy Array:
This release adds function check_out_tensor
to Model
and InferenceModel
.
You can use this function to check out the intermediate tensor values using the Python interface.
This function is especially helpful for debugging.
For more information, refer to Model.check_out_tensor
and InferenceModel.check_out_tensor
.
On-Device Input Keys for HPS Lookup:
The HPS lookup supports input embedding keys that are on GPU memory during inference.
This enhancement removes a host-to-device copy by using the DLPack lookup_fromdlpack()
interface.
By using the interface, the input DLPack capsule of embedding key can be a GPU tensor.
Documentation Enhancements:
Issues Fixed:
InteractionLayer
class is fixed so that it works correctly with num_feas > 30
.AsyncParam
class is changed to implement the fix.
The io_block_size
argument is replaced by the max_nr_request
argument and the actual I/O block size that the async reader uses is computed accordingly.
For more information, refer to the AsyncParam
class documentation.Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
Simplified Interface for 3G Embedding Table Placement Strategy:
3G embedding now provides an easier way for you to configure an embedding table placement strategy.
Instead of using JSON, you can configure the embedding table placement strategy by using function arguments.
You only need to provide the shard_matrix
, table_group_strategy
, and table_placement_strategy
arguments.
With these arguments, 3G embedding can group different tables together and place them according to the shard_matrix
argument.
For an example, refer to dlrm_train.py file in the test/embedding_collection_test
directory of the repository on GitHub.
For comparison, refer to the same file from the v4.0 branch of the repository.
New MMoE and Shared-Bottom Samples:
This release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation.
For more information, refer to the README.md
, mmoe_parquet.py
, and other files in the samples/mmoe
directory of the repository on GitHub.
This release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE.
Support for AWS S3 File System:
The Parquet DataReader can now read datasets from the Amazon Web Services S3 file system.
You can also load and dump models from and to S3 during training.
The documentation for the DataSourceParams
class is updated.
To view sample code, refer to the HugeCTR Training with Remote File System Example class is updated.
Simplication for File System Usage:
You no longer ’t need to pass DataSourceParams
for model loading and dumping.
The FileSystem
class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model.
For example, the path hdfs://localhost:9000/
is inferred as an HDFS file system and the path https://mybucket.s3.us-east-1.amazonaws.com/
is inferred as an S3 file system.
Support for Loading Models from Remote File Systems to HPS:
This release enables you to load models from HDFS and S3 remote file systems to HPS during inference.
To use the new feature, specify an HDFS for S3 path URI in InferenceParams
.
Support for Exporting Intermediate Tensor Values into a Numpy Array:
This release adds function check_out_tensor
to Model
and InferenceModel
.
You can use this function to check out the intermediate tensor values using the Python interface.
This function is especially helpful for debugging.
For more information, refer to Model.check_out_tensor
and InferenceModel.check_out_tensor
.
On-Device Input Keys for HPS Lookup:
The HPS lookup supports input embedding keys that are on GPU memory during inference.
This enhancement removes a host-to-device copy by using the DLPack lookup_fromdlpack()
interface.
By using the interface, the input DLPack capsule of embedding key can be a GPU tensor.
Documentation Enhancements:
Issues Fixed:
InteractionLayer
class is fixed so that it works correctly with num_feas > 30
.AsyncParam
class is changed to implement the fix.
The io_block_size
argument is replaced by the max_nr_request
argument and the actual I/O block size that the async reader uses is computed accordingly.
For more information, refer to the AsyncParam
class documentation.Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
3G Embedding Stablization: Since the introduction of the next generation of HugeCTR embedding in v3.7, several updates and enhancements were made, including code refactoring to improve usability. The enhancements for this release are as follows:
GlobalEmbeddingData
and LocalEmbeddingData
classes.Embedding Cache Initialization with Configurable Ratio:
In previous releases, the default value for the cache_refresh_percentage_per_iteration
parameter of the InferenceParams was 0.1
.
In this release, default value is 0.0
and the parameter provides an additional purpose.
If you set the parameter to a value greater than 0.0
and also set use_gpu_embedding_cache
to True
for a model, when Hierarchical Parameter Server (HPS) starts, HPS initializes the embedding cache for the model on the GPU by loading a subset of the embedding vectors from the sparse files for the model.
When embedding cache initialization is used, HPS creates log records when it starts at the INFO level.
The logging records are similar to EC initialization for model: "<model-name>", num_tables: <int>
and EC initialization on device: <int>
.
This enhancement reduces the duration of the warm up phase.
Lazy Initialization of HPS Plugin for TensorFlow:
In this release, when you deploy a SavedModel
of TensorFlow with Triton Inference Server, HPS is implicitly initialized when the loaded model is executed for the first time.
In previous releases, you needed to run hps.Init(ps_config_file, global_batch_size)
explicitly.
For more information, see the API documentation for hierarchical_parameter_server.Init
.
Enhancements to the HDFS Backend:
hadoop_filesystem.hpp
in the include/io
directory of the repository on GitHub.Dependency Clarification for Protobuf and Hadoop:
Hadoop and Protobuf are true third_party
modules now.
Developers can now avoid unnecessary and frequent cloning and deletion.
Finer granularity control for overlap behavior:
We deperacated the old overlapped_pipeline
knob and introduces four new knobs train_intra_iteration_overlap
/train_inter_iteration_overlap
/eval_intra_iteration_overlap
/eval_inter_iteration_overlap
to help user better control the overlap behavior. For more information, see the API documentation for Solver.CreateSolver
Documentation Improvements:
triton_tf_deploy
and dump_to_tf
.Issues Fixed:
metadata.json
does not exist, HugeCTR no longer crashes.
The new behavior is to skip the missing file and display a warning message.
This change relates to GitHub issue 321.Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Updates to 3G Embedding:
Enhancements to the HPS Plugin for TensorFlow: This release includes improvements to the interoperability of SOK and HPS. The plugin now supports the sparse lookup layer. The documentation for the HPS plugin is enhanced as follows:
Enhancements to the HPS Backend for Triton Inference Server This release adds support for integrating the HPS Backend and the TensorFlow Backend through the ensemble mode with Triton Inference Server. The enhancement enables deploying a TensorFlow model with large embedding tables with Triton by leveraging HPS. For more information, refer to the sample programs in the hps-triton-ensemble directory of the HugeCTR Backend repository in GitHub.
New Multi-Node Tutorial: The multi-node training tutorial is new. The additions show how to use HugeCTR to train a model with multiple nodes and is based on our most recent Docker container. The tutorial should be useful to users who do not have a job-scheduler-installed cluster such as Slurm Workload Manager. The update addresses a issue that was first reported in GitHub issue 305.
Support Offline Inference for MMoE: This release includes MMoE offline inference where both per-class AUC and average AUC are provided. When the number of class AUCs is greater than one, the output includes a line like the following example:
[HCTR][08:52:59.254][INFO][RK0][main]: Evaluation, AUC: {0.482141, 0.440781}, macro-averaging AUC: 0.46146124601364136
Enhancements to the API for the HPS Database Backend
This release includes several enhancements to the API for the DatabaseBackend
class.
For more information, see database_backend.hpp
and the header files for other database backends in the HugeCTR/include/hps
directory of the repository.
The enhancments are as follows:
dump
and load_dump
methods are new.
These methods support saving and loading embedding tables from disk.
The methods support a custom binary format and the RocksDB SST table file format.
These methods enable you to import and export embedding table data between your custom tools and HugeCTR.find_tables
method is new.
The method enables you to discover all table data that is currently stored for a model in a DatabaseBackend
instance.
A new overloaded method for evict
is added that can process the results from find_tables
to quickly and simply drop all the stored information that is related to a model.Documentation Enhancements
max_all_to_all_bandwidth
parameter of the HybridEmbeddingParam
class is clarified to indicate that the bandwidth unit is per-GPU.
Previously, the unit was not specified.Issues Fixed:
IB_NVLINK
as the communication_type
of the
HybridEmbeddingParam
is fixed in this release.workspace_size_per_gpu_in_mb
, we have a workaround to disable the routine by setting the environment variable HUGECTR_DISABLE_OVERFLOW_CHECK=1
. The workaround restores the training performance.Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Sample Notebook to Demonstrate 3G Embedding:
This release includes a sample notebook that introduces the Python API of the
embedding collection and the key concepts for using 3G embedding.
You can view HugeCTR Embedding Collection
from the documentation or access the embedding_collection.ipynb
file from the
notebooks
directory of the repository.
DLPack Python API for Hierarchical Parameter Server Lookup:
This release introduces support for embedding lookup from the Hierarchical
Parameter Server (HPS) using the DLPack Python API. The new method is
lookup_fromdlpack()
. For sample usage, see the
Lookup the Embedding Vector from DLPack
heading in the "Hierarchical Parameter Server Demo" notebook.
Read Parquet Datasets from HDFS with the Python API:
This release enhances the DataReaderParams
class with a data_source_params
argument. You can use the argument to specify
the data source configuration such as the host name of the Hadoop NameNode and the NameNode port number to read from HDFS.
Logging Performance Improvements: This release includes a performance enhancement that reduces the performance impact of logging.
Enhancements to Layer Classes:
FullyConnected
layer now supports 3D inputsMatrixMultiply
layer now supports 4D inputs.Documentation Enhancements:
Issues Fixed:
_metadata.json
file and the actual dataset files.
Previously, running the data generator to create synthetic data resulted in a core dump.
This issue was first reported in the GitHub issue 321.Known Issues:
Hybrid embedding with IB_NVLINK
as the communication_type
of the
HybridEmbeddingParam
class does not work currently. We are working on fixing it. The other communication types have no issues.
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
3G Embedding Developer Preview: The 3.7 version introduces next-generation of embedding as a developer preview feature. We call it 3G embedding because it is the new update to the HugeCTR embedding interface and implementation since the unified embedding in v3.1 version, which was the second one. Compared with the previous embedding, there are three main changes in the embedding collection.
concat
combiner and supporting different slot lookup on the same embedding table.dlrm_train.py
file in the embedding_collection_test directory of the repository for a more detailed usage example.HPS Performance Improvements:
TRACE
rather than INFO
or DEBUG
to reduce logging verbosity.Offline Inference Usability Enhancements:
std::thread::hardware_concurrency()
. For more information, please refer to Hierarchical Parameter Server Configuration.DataGenerator Performance Improvements:
You can specify the num_threads
parameter to parallelize a Norm
dataset generation.
Evaluation Metric Improvements:
Embedding Training Cache Parquet Demo: Created a keyset extractor script to generate keyset files for Parquet datasets. Provided users with an end-to-end demo of how to train a Parquet dataset using the embedding cache mode. See the Embedding Training Cache Example notebook.
Documentation Enhancements: The documentation details for HugeCTR Hierarchical Parameter Server Database Backend are updated for consistency and clarity.
Issues Fixed:
slot_size_array
is specified, workspace_size_per_gpu_in_mb
is no longer required.CMAKE_INSTALL_PREFIX
CMake variable to identify the installation directory for HugeCTR.sok.Init()
with a large number of GPUs. See the GitHub issue 261 and 302 for more details.Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
The Criteo 1 TB click logs dataset that is used with many HugeCTR sample programs and notebooks is currently unavailable. Until the dataset becomes downloadable again, you can run those samples based on our synthetic dataset generator. For more information, see the Getting Started section of the repository README file.
Data generator of parquet type produces inconsistent file names between _metadata.json and actual dataset files, which will result in core dump fault when using the synthetic dataset.