Torchrec Versions Save

Pytorch domain library for recommendation systems

v0.3.0

1 year ago

[ProtoType] Simplified Optimizer Fusion APIs

We’ve provided a simplified and more intuitive API for setting fused optimizer settings via apply_optimizer_in_backward. This new approach enables the ability to specify optimizer settings on a per-parameter basis and sharded modules will configure FBGEMM’s TableBatchedEmbedding modules accordingly. Additionally, this now let's TorchRec’s planner account for optimizer memory usage. This should alleviate reports of sharding jobs OOMing after using Adam using a plan generated from planner.

[ProtoType] Simplified Sharding APIs

We’re introducing the shard API, which now allows you to shard only the embedding modules within a model, and provides an alternative to the current main entry point - DistributedModelParallel. This lets you have a finer grained control over the rest of the model, which can be useful for customized parallelization logic, and inference use cases (which may not require any parallelization on the dense layers). We’re also introducing construct_module_sharding_plan, providing a simpler interface to the TorchRec sharder.

[Beta] Integration with FBGEMM's Quantized Comms Library

Applying quantization or mixed precision to tensors in a collective call during model parallel training greatly improves training efficiency, with little to no effect on model quality. TorchRec now integrates with the quantized comms library provided by FBGEMM GPU and provides an interface to construct encoders and decoders (codecs) that surround the all_to_all, and reduce_scatter collective calls in the output_dist of a sharded module. We also allow you to construct your own codecs to apply to your sharded module. The codces provided by FBGEMM allow FP16, BF16, FP8, and INT8 compressions, and you may use different quantizations for the forward path and backward pass.

Planner

We removed several unnecessary copies inside of planner that drastically decreases the runtime.
Cleaned up the Topology interface (no longer takes in unrelated information like batch size).

v0.2.0

1 year ago

Changelog

PyPi Installation

The recommended install location is now from pypy. Additionally, TorchRec's binary will not longer contain fbgemm_gpu. Instead fbgemm_gpu will be installed as a dependency. See README for details

Planner Improvements

We added some additional features and bug fixed some bugs Variable batch size per feature to support request only features Better calculations for quant UVM Caching Bug fix for shard storage fitting on device

Single process Batched + Fused Embeddings

Previously TorchRec’s abstractions (EmbeddingBagCollection/EmbeddingCollection) over FBGEMM kernels, which provide benefits such as table batching, optimizer fusion, and UVM placement, could only be used in conjunction with DistributedModelParallel. We’ve decoupled these notions from sharding, and introduced the FusedEmbeddingBagCollection, which can be used as a standalone module, with all of the above features, and can also be sharded.

Sharder

We enabled embedding sharding support for variable batch sizes across GPUs.

Benchmarking and Examples

We introduce A set of benchmarking tests, showing performance characteristics of TorchRec’s base modules and research models built out of TorchRec. We provide an example demonstrating training a distributed TwoTower (i.e. User-Item) Retrieval model that is sharded using TorchRec. The projected item embeddings are added to an IVFPQ FAISS index for candidate generation. The retrieval model and KNN lookup are bundled in a Pytorch model for efficient end-to-end retrieval. inference example with Torch Deploy for both single and multi GPU

Integrations

We demonstrate that TorchRec works out of the box with many components commonly used alongside PyTorch models in production like systems, such as

Training a TorchRec model on Ray Clusters utilizing the Torchx Ray scheduler
Preprocessing and DataLoading with NVTabular on DLRM
Training a TorchRec model with on-the-fly preprocessing with TorchArrow showcasing RecSys domain UDFs.

Scriptable Unsharded Modules

The unsharded embedding modules (EmbeddingBagCollection/EmbeddingCollection and variants) are now torch scriptable.

EmbeddingCollection Column Wise Sharding

We now support column wise sharding for EmbeddingCollection, enabling sequence embeddings to be sharded column wise.

JaggedTensor

Boost performance of to_padded_dense function by implementing with FBGEMM.

Linting

Add lintrunner to allow contributors to lint and format their changes quickly, matching our internal formatter.

v0.1.1

2 years ago

Changelog

pytorch.org Install

The recommended install location is now from download.pytorch.org. See README for details

Recmetrics

RecMetrics is a metrics library that collects common utilities and optimizations for Recommendation models.

A centralized metrics module that allows users to add new metrics
Commonly used metrics, including AUC, Calibration, CTR, MSE/RMSE, NE & Throughput
Optimization for metrics related operations to reduce the overhead of metric computation
Checkpointing

Torchrec inference

Larger models need GPU support for inference. Also, there is a difference between features used in common training stacks and inference stacks. The goal of this library is to make use of some features seen in training to make inference more unified and easier to use.

EmbeddingTower and EmbeddingTowerCollection

a new sharadable nn.Module called EmbeddingTower/EmbeddingTowerCollection. This module will give model authors the basic building block to establish a clear relationship between a set of embedding tables and post lookup modules.

Examples/tutorials

Inference example

documentation (installation and example), updated cmake build and gRPC server example

Bert4rec example

Reproduction of bert4rec paper showcasing EmbeddingCollection module (non pooling)

Sharding Tutorial

Overview of sharding in torchrec and the five types of sharding https://pytorch.org/tutorials/advanced/sharding.html

Improved Planner

Updated static estimates for perf
Models full model parallel path
Includes support for sequence embeddings, weighted features, and feature processors
Added grid search proposer

v0.1.0

2 years ago

We are excited to announce TorchRec, a PyTorch domain library for Recommendation Systems. This new library provides common sparsity and parallelism primitives, enabling researchers to build state-of-the-art personalization models and deploy them in production.

Modeling primitives, such as embedding bags and jagged tensors, that enable easy authoring of large, performant multi-device/multi-node models using hybrid data-parallelism and model-parallelism. Optimized RecSys kernels powered by FBGEMM , including support for sparse and quantized operations. A sharder which can partition embedding tables with a variety of different strategies including data-parallel, table-wise, row-wise, table-wise-row-wise, and column-wise sharding. A planner which can automatically generate optimized sharding plans for models. Pipelining to overlap dataloading device transfer (copy to GPU), inter-device communications (input_dist), and computation (forward, backward) for increased performance. GPU inference support. Common modules for RecSys, such as models and public datasets (Criteo & Movielens).

See our announcement and docs