Alibaba DeepRec Versions Save

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

r1.15.5-deeprec2402

2 months ago

Major Features and Improvements

Embedding

  • Refine KVInterface::GetShardedSnapshot API.
  • Undefine EV GPU interface in CPU compile.
  • Make Embedding backward compatible with previous saved_model.
  • Log error when EV has been initialized in EV Import OP.

Op Implement

  • Implement of SliceSend/SliceRecv Op.
  • Implement FileSliceSend/FileSliceRecvOp.

SDK

  • Add build SDK package.

BugFix

  • Fix shared embedding frequency counting problem.
  • Fix Graph contains EmbeddingVariable compiling issue.
  • Fix a scheduling issue.
  • Fix tensor shape meta-data bug for DataFrame Value.

ModelZoo

  • Set Saver's parameter sharded=True in distributed training.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2402-cpu-py38-ubuntu20.04

GPU Image

alideeprec/deeprec-release:deeprec2402-gpu-py38-cu116-ubuntu20.04

r1.15.5-deeprec2310

5 months ago

Major Features and Improvements

Embedding

  • Refactor the data structure of EmbeddingVariable.
  • Add interface of EmbeddingVar for Elastic Training.
  • Add GetSnapshot and Create API for EmbeddingVariable.
  • Remove the dependency on private header file in EmbeddingVariable.

Runtime Optimization

  • Canonicalize SaveV2 Op device spec in distributed training.
  • Update log level in direct_session.

Distributed

  • Add elastic-grpc server.

BugFix

  • Fix missing return value of RestoreSSD of DramSSDHashStorage.
  • Fix incorrect frequency in shared-embedding.
  • Fix set initialized flag too early in restore subgraph.
  • Fix wgrad bug in Sparse Operation Kit.
  • Fix hang bug for async embedding lookup.
  • Fix ps address list sort by index.
  • Fix SharedEmbeddingColumn with PartitionedEmbedingVariable shape validation error.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2310-cpu-py38-ubuntu20.04

GPU Image

alideeprec/deeprec-release:deeprec2310-gpu-py38-cu116-ubuntu20.04

r1.15-deeprec2306

9 months ago

Major Features and Improvements

Embedding

  • Support StaticGPUHashMap to optimize EmbeddingVariable in inference.
  • Update logic of GroupEmbedding in feature_column API.
  • Refine APIs for foward-backward optimization.
  • Move insertions of new features into the backward process when lti-tier storage.
  • Move insertion of new features into the backward ops.
  • Modify calculation logic of embedding lookup sparse combiner.
  • Add memory and performance tests of EmbeddingVariable.

Graph & Grappler Optimization

  • Support IteratorGetNext for SmartStage as a starting node for searching.
  • Reimplement PrefetchRunner in C++.

Runtime Optimization

  • Dispatch expensive ops via multiple threads in theadpool.
  • Enable multi-stream in session_group by default.
  • Support for loading saved_model with device information when use p and multi_stream.
  • Make ARENA_ARRAY_SIZE to be configurable.
  • Optimize EV allocator performance.
  • Integrate HybridBackend in collective training mode.

Ops & Hardware Acceleration

  • Disable MatMul fused with LeakyRule when MKL is disabled.

Serving

  • Clear virtual_device configurations before load new checkpoint.

Environment & Build

  • Update docker images in user documents.
  • Update DEFAULT_CUDA_VERSION and DEFAULT_CUDNN_VERSION in configure.py.
  • Move thirdparties from WORKSPACE to workspace.bzl.
  • Update urls corresponding to colm, ragel, aliyun-oss-sdk and uuid.
  • Update default TF_CUDA_COMPUTE_CAPABILITIES to 7.0,7.5,8.0,8.6.
  • Update SparseOperationKit to v23.5.01 and docker file.

BugFix

  • Fix issue of missing params while constructing the ngScope.
  • Fix memory leak to avoid OOM.
  • Fix shape validation in API shared_embedding_columns.
  • Fix the device placement bug of stage_subgraph_on_cpu in distributed.
  • Fix hung issue when using both SOK and SmartStaged simultaneously.
  • Fix bug: init global_step before saving variables
  • Fix bug: reserve input nodes, clear saver devices on demand.
  • Fix memory leak when a graph node is invalid.

ModelZoo

  • Add examples and docs to demonstrate Collective Training.
  • Update documents and config files for modelzoo benchmark.
  • Update modelzoo README.

Tool & Documents

  • Update cases of configure TF_CUDA_COMPUTE_CAPABILITIES for H100.
  • Update COMMITTERS.md.
  • Update device placement documents.
  • Update document for SmartStage.
  • Update session_group documents.
  • Update the download link of the library that Processor depends on.
  • Update sok to 1.20.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2306-cpu-py38-ubuntu20.04

GPU Image

alideeprec/deeprec-release:deeprec2306-gpu-py38-cu116-ubuntu20.04

r1.15.5-deeprec2304

11 months ago

Major Features and Improvements

Embedding

  • Suport tf.int32 dtype using feature_column API tf.feature_column.categorical_column_with_embedding.
  • Make the rules of export frequencies and versions the same as the rule of export keys.
  • Optimize cuda kernel implementation in GroupEmbedding.
  • Support to read embedding files with mmap and madvise, and direct IO.
  • Add double check in find_wait_free of lockless dense hashmap.
  • Change Embedding init value of version in EV from 0 to -1.
  • Interface 'GetSnapshot()' backward compatibility.
  • Implement CPU GroupEmbedding lookup sparse Op.
  • Make GroupEmbedding compatible with sequence feature_column interface.
  • Fix sp_weights indices calculation error in GroupEmbedding.
  • Add group_strategy to control parallelism of group_embedding.

Graph & Grappler Optimization

  • Support SparseTensor as placeholder in Sample-awared Graph Compression.
  • Add Dice fusion grappler and ops.
  • Enable MKL Matmul + Bias + LeakyRelu fusion.

Runtime Optimization

  • Avoid unnecessary polling in EventMgr.
  • Reduce lock cost and memory usage in EventMgr when use multi-stream.

Ops & Hardware Acceleration

  • Register GPU implementation of int64 type for Prod.
  • Register GPU implementation of string type for Shape, ShapeN and ExpandDims.
  • Optimize list of GPU SegmentReductionOps.
  • Optimize zeros_like_impl by reducing calls to convert_to_tensor.
  • Implement GPU version of SparseSlice Op.
  • Delay Reshape when rank > 2 in keras.layers.Dense so that post op can be fused with MatMul.
  • Implement setting max_num_threads hint to oneDNN at compile time.
  • Implement TensorPackTransH2DOp to improve SmartStage performance on GPU.

IO

  • Add tensor shape meta-data support for ParquetDataset.
  • Add arrow BINARY type support for ParquetDataset.

Serving

  • Add Dice fusion to inference mode.
  • Enable INFERENCE_MODE in processor.
  • Support TensorRT 8.x in Inference.
  • Add configure filed to control enable TensorRT or not.
  • Add flag for device_placement_optimization.
  • Avoid to clustering feature column related nodes when enable TensorRT.
  • Optimize inference latency when load increment checkpoint.
  • Optimize performance via only place TensorRT ops to gpu device.

Environment & Build

  • Support CUDA 12.
  • Update DEFAULT_CUDA_VERSION and DEFAULT_CUDNN_VERSION in configure.py.
  • Move thirdparties from WORKSPACE to workspace.bzl.
  • Update urls corresponding to colm, ragel, aliyun-oss-sdk and uuid.

BugFix

  • Fix constant op placing bug for device placement optimization.
  • Fix Nan issue occurred in group_embedding API.
  • Fix SOK not compatible with variable issue.
  • Fix memory leak when update full model in serving.
  • Fix 'cols_to_output_tensors' not setted issue in GroupEmbedding.
  • Fix core dump issue about saving GPU EmbeddingVariable.
  • Fix cuda resource issue in KvResourceImportV3 kernel.
  • Fix loading signature_def with coo_sparse bug and add UT.
  • Fix the bug that the training ends early when the workqueue is enabled.
  • Fix the control edge connection issue in device placement optimization.

ModelZoo

  • Modify GroupEmbedding related function usage.
  • Update masknet example with layernorm.

Tool & Documents

  • Add tools for remove filtered features in checkpoint.
  • Add Arm Compute Library (ACL) user documents.
  • Update Embedding Variable document to fix initializer config example.
  • Update GroupEmbedding document.
  • Update processor documents.
  • Add user documents for intel AMX.
  • Add TensorRT usage documents.
  • Update documents for ParquetDataset.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2304-cpu-py38-ubuntu20.04

GPU Image

alideeprec/deeprec-release:deeprec2304-gpu-py38-cu116-ubuntu20.04

r1.15.5-deeprec2302

1 year ago

Major Features and Improvements

Embedding

  • Support same saver graph for EmbeddingVariable on GPU/CPU devices.
  • Support save and restore parameters in HBM storage of EmbeddingVariable.
  • Add GPU apply ops of Adam, AdamAsync, AdamW for multi-tier storage of EmbeddingVariable.
  • Place output of KvResourceIsInitializedOp on CPU.
  • Support GroupEmbedding to pack multiple feature columns lookup/apply.
  • Optimize HBM-DRAM storage of EmbeddingVariable with intra parallelism and fine-grained synchronization.
  • Support not saving filtered features when saving checkpoint.
  • Support localized mode fusion in GroupEmbedding.
  • Support to avoid preloaded IDs being eliminated in multi-tier embedding's cache.
  • Support COMPACT layout to reduce memory cost in EmbeddingVariable.
  • Support to ignore version when restore Embedding Variable with TF_EV_RESET_VERSION.
  • Support restore custom dimension of Embedding Variable.
  • Support merge and delete checkpoint files of SSDHash storage.

Graph & Grappler Optimization

  • Optimize SmartStage by prefetching LookupID op.
  • Decouple SmartStage and forward backward joint optimization.
  • Support Sample-awared Graph Compression.
  • Support CUDA multi-stream for Stage.
  • Improve Device Placement Optimization performance.
  • Add TensorBufferPutGpuOp to improve SmartStage performance on GPU device.

Runtime Optimization

  • Enable EVAllocator by default.
  • Optimize executor to eliminate sort latency and reduce memory.

Ops & Hardware Acceleration

  • Add list of GPU Ops for forward backward joint optimization.
  • Optimize FusedBatchNormGrad on CPU device.
  • Support NCHW format input for FusedBatchNormOp.
  • Use new asynchronous evaluation in Eigen to FusedBatchNorm.
  • Add exponential_avg_factor attribute to FusedBatchNorm* kernels.
  • Change AliUniqueGPU kernel implementation to AsyncOpKernel.
  • Support computing exponential running mean and variance in fused_batch_norm.
  • Upgrade oneDNN to 2.7 and ACL to 22.08.
  • Use global cache for MKL primitives for ARM.
  • Disable optimizing batch norm as sequence of post ops on AArch64.
  • Restore re-mapper and fix BatchMatmul and FactoryKeyCreator under AArch64 + ACL.

Distributed

  • Speedup SOK by GroupEmbedding which fuse multiple feature column together.

Serving

  • Support to setup gpu config in SessionGroup.
  • Support to use multiple GPUs in SessionGroup.
  • Support processor to set multi-stream option.
  • Add flag to disable per_session_host_allocator.
  • Run init_op on all sessions in session_group.
  • Skip invalid request and return error msg to client.
  • Use graph signature as the key to get runtime executor.

Environment & Build

  • Optimize compile time for kv_variable_ops module.
  • Add dataset headers for custom op compilation.
  • Add docker images for ARM based on ubuntu22.04.
  • Upgrade BAZEL version to 3.7.2.

BugFix

  • Do not cudaSetDevice to invisible GPU in CreateDevices.
  • Fix concurrency issue caused by not reference to same lock in multi-tier storage.
  • Fix parse input request bug.
  • Fix the bug when saving empty GPU EmbeddingVariable.
  • Fix the concurrency issue between feature eviction and embedding lookup in asynchronous training.

ModelZoo

  • Support Parquet Dataset in list of models.
  • Add GPU benchmark in Modelzoo.
  • Unify the usage of price column in Taobao dataset.
  • Add DeepFM model with int64 categorical id input.
  • Update dataset url in Modelzoo.

Tool & Documents

  • Add checkpoint meta transformer tool.
  • Add list of user documents in English.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2302-cpu-py38-ubuntu20.04

GPU Image

alideeprec/deeprec-release:deeprec2302-gpu-py38-cu116-ubuntu20.04

r1.15.5-deeprec2212u1

1 year ago

Major Features and Improvements

BugFix

  • Add flag to disable per_session_host_allocator.
  • Fix bug of saving EmbeddingVariable with int32 type.
  • Revert "Support fused batchnorm with any ndims and axis".

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2212u1-cpu-py38-ubuntu20.04

GPU Image

alideeprec/deeprec-release:deeprec2212u1-gpu-py38-cu116-ubuntu20.04

r1.15.5-deeprec2212

1 year ago

Major Features and Improvements

Embedding

  • Refactor GPU Embedding Variable storage layer.
  • Remove TENSORFLOW_USE_GPU_EV macro from embedding storage layer.
  • Refactor KvResourceGather GPU Op.
  • Add embedding memory pool for HBM storage of EmbeddingVariable.
  • Refine the code HBM storage of EmbeddingVariable.
  • Reuse the embedding files on SSD generated by EmbeddingVariable when save and restore checkpoint.
  • Integrate single HBM EV into multi_tier EmbeddingVariable.

Graph & Grappler Optimization

  • Filter out the 'stream_id' attribute in arithmetic optimizer.
  • Add SimplifyEmbeddingLookupStage optimizer.
  • Add ForwardBackwardJointOptimizationPass to eliminate duplicate hash in Gather and Apply ops for Embedding Variable.

Runtime Optimization

  • Add allocators for each stream_executor in multi-context mode.
  • Set multi-gpu devices in session_group mode.
  • Add blacklist and whitelist to JitCugraph.
  • Optimize CPU EVAllocator to speedup EmbeddingVariable performance.
  • Support independent GPU host allocator for each session.
  • Add GPU EVAllocator to speedup EmbeddingVariable on GPU.

Ops & Hardware Acceleration

  • Add GPU implementation for Unique.
  • Support indices type with DT_INT64 in sparse segment ops.
  • Add list of gradient implementation for the following ops including SplitV, ConcatV2, BroadcastTo, Tile, GatherV2, Cumsum, Cast.
  • Add C++ gradient op for Select.
  • Add gradient implementation for SelectV2.
  • Add C++ gradient op for Atan2.
  • Add C++ gradients for UnsortedSegmentMin/Max/Sum.
  • Refactor KvSparseApplyAdagrad GPU Op.
  • Merge NV-TF r1.15.5+22.12.

Distributed

  • Update seastar to control SDT by macro HAVE_SDT.
  • Update WORKER_DEFAULT_CORE_NUM(8) and PS_EFAULT_CORE_NUM(2) default values.

Serving

  • Support multi-model deployment in SessionGroup.
  • Support user setup cpu-sets for each session_group.
  • Support processor to load multi-models.
  • Support GPU compilation in processor.
  • Optimize independent GPU host allocator for each session.

Environment & Build

  • Update systemtap to a valid source address.
  • Support DeepRec's ABI compatible with TensorFlow 1.15 by configure TF_API_COMPATIBLE_1150.
  • Upgrade base docker images based on ubuntu20.04 and python3.8.10.
  • Update pcre-8.44 urls.
  • Remove systemtap from third party and related dependency.
  • Enable gcc optimization option -O3 by default.

BugFix

  • Fix function definition issue in processor.
  • Fix the hang when insert item into lockless hash map.
  • Fix EmbeddingVariable hang/coredump in GPU mode.
  • Fix memory leak in CUDA multi-stream when merge compute and copy stream.
  • Fix wrong session devices order.
  • Fix hwloc build error on alinux3.
  • Fix double clear resource_mgr bug when use SessionGroup.
  • Fix wrong Shrink causes unit tests to fail randomly.
  • Fix the conflict when the EmbeddingVariable and embedding fusion is enabled simultaneously.
  • Fix EmbeddingVarGPU coredump in destructor.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2212-cpu-py38-ubuntu20.04

GPU Image

alideeprec/deeprec-release:deeprec2212-gpu-py38-cu116-ubuntu20.04

r1.15.5-deeprec2210

1 year ago

Major Features and Improvements

Embedding

  • Support HBM-DRAM-SSD storage in EmbeddingVariable multi-tier storage.
  • Support multi-tier EmbeddingVariable initialized based on frequency when restore model.
  • Support to lookup location of ids of EmbeddingVariable.
  • Support kv_initialized_op for GPU Embedding Variable.
  • Support restore compatibility of EmbeddingVariable using init_from_proto.
  • Improve performance of apply/gather ops for EmbeddingVariable.
  • Add Eviction Manager in EmbeddingVariable Multi-tier storage.
  • Add unified thread pool for cache of Multi-tier storage in EmbeddingVariable.
  • Save frequencies and versions of features in SSDHash and LevelDB storage of EmbeddingVariable.
  • Avoid invalid eviction use HBM-DRAM storage of EmbeddingVariable.
  • Preventing from accessing uninitialized data use EmbeddingVariable.

Graph & Grappler Optimization

  • Optimize Async EmbeddingLookup by placement optimization.
  • Place VarHandlerOp to Compute main graph for SmartStage.
  • Support independent thread pool for stage subgraph to avoid thread contention.
  • Implement device placement optimization.

Runtime Optimization

  • Support CUDA Graph execution by adding CUDA Graph mode session.
  • Support CUDA Graph execution in JIT mode.
  • Support intra task cost estimate in CostModel in Executor.
  • Support tf.stream and tf.colocate python API for CUDA multi-stream.
  • Support embedding subgraphs partition policy when use CUDA multi-stream.
  • Optimize CUDA multi-stream by merging copy stream into compute stream.

Ops & Hardware Acceleration

  • Add a list of Quantized* and _MklQuantized* ops.
  • Implement GPU version of SparseFillEmptyRows.
  • Implement c version of spin_lock to support multi-architectures.
  • Upgrade the OneDNN version to v2.7.

Distributed

  • Support distributed training use SOK based on EmbeddingVariable.
  • Add NETWORK_MAX_CONNECTION_TIMEOUT to support connection timeout configurable in StarServer.
  • Upgrade the SOK version to v4.2.

IO

  • Add TF_NEED_PARQUET_DATASET to enable ParquetDataset.

Serving

  • Optimize embedding lookup performance by disable feature filter when serving.
  • Optimize error code for user when parse request or response failed.
  • Support independent update model threadpool to avoid performance jitter.

ModelZoo

  • Add MaskNet Model.
  • Add PLE Model.
  • Support variable type BF16 in DCN model.

BugFix

  • Fix tf.nn.embedding_lookup interface bug and session hang bug when enabling async embedding.
  • Fix warmup failed bug when user set warmup file path.
  • Fix build failure in ev_allocator.cc and hash.cc on ARM.
  • Fix build failure in arrow when build on ARM
  • Fix redefined error in NEON header file for ARM.
  • Fix _mm_malloc build failure in sparsehash on ARM.
  • Fix warmup failed bug when use session_group.
  • Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.
  • Fix the colocation error when using EmbeddingVariable in distribution.
  • Fix HostNameToIp fails by replacing gethostbyname by getaddrinfo in StarServer.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2210-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2210-gpu-py36-cu116-ubuntu18.04

Thanks to our Contributors

Duyi-Wang, Locke, shijieliu, Honglin Zhu, chenxujun, GosTraight2020, LALBJ, Nanno

r1.15.5-deeprec2208u1

1 year ago

Major Features and Improvements

BugFix

  • Fix a list of Quantized* and _MklQuantized* ops not found issue.
  • Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.
  • Fix warmup failed bug when user set warmup file path.
  • Fix warmup failed bug when use session_group.

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2208u1-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2208u1-gpu-py36-cu116-ubuntu18.04

r1.15.5-deeprec2208

1 year ago

Major Features and Improvements

Embedding

  • Multi-tier of EmbeddingVariable support HBM, add async compactor in SSDHashKV.
  • Support tf.feature_column.shard_embedding_columns, SequenceCategoricalColumn and WeightedCategoricalColumn API for EmbeddingVariable.
  • Support save and restore checkpoint of GPU EmbeddingVariable.
  • Support EmbeddingVariable OpKernel with REAL_NUMBER_TYPES.
  • Support user defined default_value for feature filter.
  • Support feature column API for MultiHash.

Graph & Grappler Optimization

  • Add FP32 fused l2 normalize op and grad op and tf.nn.fused_layer_normalize API.
  • Add Concat+Cast fusion ops.
  • Optimize SmartStage performance on GPU.
  • Add macro to control to optimize mkl_layout_pass.
  • Support asynchronous embedding lookup.

Runtime Optimization

  • CPUAllocator, avoid multiple threads cleanup at the same time.
  • Support independent intra threadpool for each session and intra threadpool be pinned to cpuset.
  • Support multi-stream with virtual device.

Ops & Hardware Acceleration

  • Implement ApplyFtrl, ResourceApplyFtrl, ApplyFtrlV2 and ResourceApplyFtrlV2 GPU kernels.
  • Optimize BatchMatmul GPU kernel.
  • Integrate cuBLASlt into backend and use BlasLtMatmul in batch_matmul_op.
  • Support GPU fusion of matmal+bias+(activation).
  • Merge NV-TF r1.15.5+22.06.

Optimizer

  • Support AdamW optimizer for EmbeddingVariable.

Model Save/Restore

  • Support asynchronously restore EmbeddingVariable from checkpoint.
  • Support EmbeddingVariable in init_from_checkpoint.

Serving

  • Add go/java/python client SDK and demo.
  • Support GPU multi-streams in SessionGroup.
  • Support independent inter thread pool for each session in SessionGroup.
  • Support multi-tiered Embedding.
  • Support immutable EmbeddingVariable.

Quantization

  • Add low precision optimization tool, support BF16, FP16, INT8 for savedmodel and checkpoint.
  • Add embedding variable quantization.

ModelZoo

  • Optimize DIN's BF16 performance.
  • Add DCN & DCNv2 models and MLPerf recommendation benchmark.

Profiler

  • Add detail information for RecvTensor in timeline.

Dockerfile

  • Add ubuntu 22.04 dockerfile and images with gcc11.2 and python3.8.6.
  • Add cuda11.2, cuda11.4, cuda11.6, cuda11.7 docker images and use cuda 11.6 as default GPU image.

Environment & Build

  • Update default TF_CUDA_COMPUTE_CAPABILITIES to 6.0,6.1,7.0,7.5,8.0.
  • Upgrade bazel version to 0.26.1.
  • Support for building DeepRec on ROCm2.10.0.

BugFix

  • Fix build failures with gcc11 & gcc12.
  • StarServer, remove user packet split to avoid multiple user packet out-of-order issue.
  • Fix the 'NodeIsInGpu is not declare' issue.
  • Fix the placement bug of worker devices when distributed training in Modelzoo.
  • Fix out of range issue for BiasAddGrad op when enable AVX512.
  • Avoid loading invalid model when model update in serving.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2208-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2208-gpu-py36-cu116-ubuntu18.04