Models, data loaders and abstractions for language processing, powered by PyTorch
In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.
We increased the number of datasets in TorchText from 30 to 31 by adding the CNN-DM (paper) dataset. The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2
from torchdata
. For more details, refer to https://pytorch.org/text/stable/datasets.html
TorchText has extended support for TorchScriptable tokenizers by adding a RegexTokenizer that enables splitting based on regular expressions. TorchScriptabilty support would allow users to embed the Regex Tokenizer natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate Regex tokenizers for deployment needs.
This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.
For the full feature of v0.13, please refer to the v0.13.0 release note.
In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.
We increased the number of datasets in TorchText from 22 to 30 by adding the remaining 8 datasets from the GLUE benchmark (SST-2 was already supported). The complete list of GLUE datasets is as follows:
The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2
from torchdata
. For more details, refer to https://pytorch.org/text/stable/datasets.html
TorchText has extended support for TorchScriptable tokenizers by adding the WordPiece tokenizer used in BERT. It is one of the most commonly used algorithms for splitting input text into sub-words units and was introduced in Japanese and Korean Voice Search (Schuster et al., 2012).
TorchScriptabilty support would allow users to embed the BERT text-preprocessing natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate BERT tokenizers for deployment needs.
For usage details, please refer to the corresponding documentation.
TorchText has migrated its build system for C++ extension and third party libraries to use CMake rather than PyTorch’s CppExtension
module. This allows end-users to integrate TorchText C++ binaries in their applications without having a dependency on libpython
thus allowing them to use TorchText operators in a non-Python environment.
Refer to the GitHub issue for more details.
The RobertaModelBundle
introduced in 0.12 release, which gets pre-trained RoBERTa/XLM-R models and builds custom models with similar architecture, has been renamed to RobertaBundle
(#1653).
The default caching location (cache_dir
) has been changed from os.path.expanduser("~/.TorchText/cache")
to os.path.expanduser("~/.cache/torch/text")
. Furthermore the default root directory of datasets is cache_dir/datasets
(#1740). Users can now control default cache location via the TORCH_HOME
environment variable (#1741)
Support for GLUE benchmark’s datasets added:
Others
In this release, we have revamped the library to provide a more comprehensive experience for users to do NLP modeling using TorchText and PyTorch.
TorchText has modernized its datasets by migrating from older-style Iterable Datasets to TorchData’s DataPipes. TorchData is a library that provides modular/composable primitives, allowing users to load and transform data in performant data pipelines. These DataPipes work out-of-the-box with PyTorch DataLoader and would enable new functionalities like auto-sharding. Users can now easily do data manipulation and pre-processing using user-defined functions and transformations in a functional style programming. Datasets backed by DataPipes also enable standard flow-control like batching, collation, shuffling and bucketizing. Collectively, DataPipes provides a comprehensive experience for data preprocessing and tensorization needs in a pythonic and flexible way for model training.
from functools import partial
import torchtext.functional as F
import torchtext.transforms as T
from torch.hub import[ load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)
from torch.utils.data import DataLoader
from torchtext.datasets import SST2
# Tokenizer to split input text into tokens
encoder_json_path = "https://download.pytorch.org/models/text/gpt2_bpe_encoder.json"
vocab_bpe_path = "https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe"
tokenizer = T.GPT2BPETokenizer(encoder_json_path, vocab_bpe_path)
# vocabulary converting tokens to IDs
vocab_path = "https://download.pytorch.org/models/text/roberta.vocab.pt"
vocab = T.VocabTransform([load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)(vocab_path))
# Add BOS token to the beginning of sentence
add_bos = T.AddToken(token=0, begin=True)
# Add EOS token to the end of sentence
add_eos = T.AddToken(token=2, begin=False)
# Create SST2 dataset datapipe and apply pre-processing
batch_size = 32
train_dp = SST2(split="train")
train_dp = train_dp.batch(batch_size).rows2columnar(["text", "label"])
train_dp = train_dp.map(tokenizer, input_col="text", output_col="tokens")
train_dp = train_dp.map(partial(F.truncate, max_seq_len=254), input_col="tokens")
train_dp = train_dp.map(vocab, input_col="tokens")
train_dp = train_dp.map(add_bos, input_col="tokens")
train_dp = train_dp.map(add_eos, input_col="tokens")
train_dp = train_dp.map(partial(F.to_tensor, padding_value=1), input_col="tokens")
train_dp = train_dp.map(F.to_tensor, input_col="label")
# create DataLoader
dl = DataLoader(train_dp, batch_size=None)
batch = next(iter(dl))
model_input = batch["tokens"]
target = batch["label"]
TorchData is required in order to use these datasets. Please install following instructions at https://github.com/pytorch/data
We have added support for pre-trained RoBERTa and XLM-R models. The models are torchscriptable and hence can be employed for production use-cases. The modeling APIs let users attach custom task-specific heads with pre-trained encoders. The API also comes equipped with data pre-processing transforms to match the pre-trained weights and model configuration.
import torch, torchtext
from torchtext.functional import to_tensor
xlmr_base = torchtext.models.XLMR_BASE_ENCODER
model = xlmr_base.get_model()
transform = xlmr_base.transform()
input_batch = ["Hello world", "How are you!"]
model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape
torch.Size([2, 6, 768])
# add classification head
import torch.nn as nn
class ClassificationHead(nn.Module):
def __init__(self, input_dim, num_classes):
super().__init__()
self.output_layer = nn.Linear(input_dim, num_classes)
def forward(self, features):
#get features from cls token
x = features[:, 0, :]
return self.output_layer(x)
binary_classifier = xlmr_base.get_model(head=ClassificationHead(input_dim=768, num_classes=2))
output = binary_classifier(model_input)
output.shape
torch.Size([2, 2])
We have revamped our transforms to provide composable building blocks to do text pre-processing. They support both batched and non-batched inputs. Furthermore, we have added support for a number of commonly used tokenizers including SentencePiece, GPT-2 BPE and CLIP.
import torchtext.transforms as T
from torch.hub import load_state_dict_from_url
padding_idx = 1
bos_idx = 0
eos_idx = 2
max_seq_len = 256
xlmr_vocab_path = r"https://download.pytorch.org/models/text/xlmr.vocab.pt"
xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"
text_transform = T.Sequential(
T.SentencePieceTokenizer(xlmr_spm_model_path),
T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),
T.Truncate(max_seq_len - 2),
T.AddToken(token=bos_idx, begin=True),
T.AddToken(token=eos_idx, begin=False),
)
text_transform([“Hello World”, “How are you”])
We have added an end-2-end tutorial to perform SST-2 binary text classification with pre-trained XLM-R base architecture and demonstrates the usage of new datasets, transforms and models.
We have removed the legacy folder in this release which provided access to legacy datasets and abstractions. For additional information, please refer to the corresponding github issue (#1422) and PR (#1437)
Any
to support torch-scriptability during transform composability (#1453)Migration of datasets on top of datapipes
Newly added datasets
Misc
Revamp TorchText dataset testing to use mocked data
Others
Dataset Documentation
This is a minor release compatible with PyTorch 1.10.2.
There is no feature change in torchtext from 0.11.1. For the full feature of v0.11.1, please refer to the v0.11.1 release notes.
This is a relatively lightweight release while we are working on revamping the library. Users are encouraged to check various developments on the main branch.
This release depends on pytorch 1.9.1 No functional changes other than minor updates to CI rules.
In this release, we introduce a new Vocab module that replaces the current Vocab class. The new Vocab provides common functional APIs for NLP workflows. This module is backed by an efficient C++ implementation that reduces look-up time by up-to ~85% for batch look-up (refer to summary of #1248 and #1290 for further information on benchmarks), and provides support for TorchScript. We provide accompanying factory functions that can be used to build the Vocab object either through a python ordered dictionary or an Iterator that yields lists of tokens.
import io
from torchtext.vocab import build_vocab_from_iterator
# generator that yield list of tokens
def yield_tokens(file_path):
with io.open(file_path, encoding = 'utf-8') as f:
for line in f:
yield line.strip().split()
# get Vocab object
vocab_obj = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])
from torchtext.vocab import vocab
from collections import Counter, OrderedDict
counter = Counter(["a", "a", "b", "b", "b"])
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vocab_obj = vocab(ordered_dict)
# look-up index
vocab_obj["a"]
# batch look-up indices
vocab_obj.looup_indices(["a","b"])
# support forward API of PyTorch nn Modules
vocab_obj(["a","b"])
# batch look-up tokens
vocab_obj.lookup_tokens([0,1])
# set default index to return when token not found
vocab_obj.set_default_index(0)
vocab_obj["out_of_vocabulary"] #prints 0
# retired Vocab class
from torchtext.legacy.vocab import Vocab as retired_vocab
from collections import Counter
tokens_list = ["a", "a", "b", "b", "b"]
counter = Counter(tokens_list)
vocab_obj = retired_vocab(counter, specials=["<unk>","<pad>"], specials_first=True)
# new Vocab Module
from torchtext.vocab import build_vocab_from_iterator
vocab_obj = build_vocab_from_iterator([tokens_list], specials=["<unk>","<pad>"], specials_first=True)
from torchtext.datasets import IMDB
from torchtext.data import to_map_style_dataset
train_iter = IMDB(split='train')
#convert iterator to map-style dataset
train_dataset = to_map_style_dataset(train_iter)
from torchtext.data.functional import filter_wikipedia_xml
from torchtext.datasets import EnWik9
data_iter = EnWik9(split='train')
# filter data according to https://github.com/facebookresearch/fastText/blob/master/wikifil.pl
filter_data_iter = filter_wikipedia_xml(data_iter)
# Added datasets for http://www.statmt.org/wmt16/multimodal-task.html#task1
from torchtext.datasets import Multi30k
train_data, valid_data, test_data = Multi30k()
next(train_data)
# prints following
#('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.\n',
# 'Two young, White males are outside near many bushes.\n')
This is a minor release following pytorch 1.8.1. Please refer to torchtext 0.9.0 release note for more details.
In this release, we’re updating torchtext’s datasets to be compatible with the PyTorch DataLoader, and deprecating torchtext’s own DataLoading abstractions. We have published a full review of the legacy code and the new datasets in pytorch/text #664. These new datasets are simple string-by-string iterators over the data, rather than the previously custom set of abstractions such as Field
. The legacy Datasets and abstractions have been moved into a new legacy folder to ease the migration, and will remain there for two more releases. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide (link).
The following raw text datasets are available as the replacement of the legacy datasets. Those datasets are iterators which yield the raw text data line-by-line. To apply those datasets in the NLP workflows, please refer to the end-to-end tutorial for the text classification task (link).
We add Python 3.9 support in this release
The current users of the legacy code will experience BC breakage as we have retired the legacy code (#1172, #1181, #1183). The legacy components are placed in torchtext.legacy.data
folder as follows:
torchtext.data.Pipeline
-> torchtext.legacy.data.Pipeline
torchtext.data.Batch
-> torchtext.legacy.data.Batch
torchtext.data.Example
-> torchtext.legacy.data.Example
torchtext.data.Field
-> torchtext.legacy.data.Field
torchtext.data.Iterator
-> torchtext.legacy.data.Iterator
torchtext.data.Dataset
-> torchtext.legacy.data.Dataset
This means, all features are still available, but within torchtext.legacy
instead of torchtext.
Table 1: Summary of the legacy datasets and the replacements in 0.9.0 release
Category | Legacy | 0.9.0 release |
---|---|---|
Language Modeling | torchtext.legacy.datasets.WikiText2 | torchtext.datasets.WikiText2 |
torchtext.legacy.datasets.WikiText103 | torchtext.datasets.WikiText103 | |
torchtext.legacy.datasets.PennTreebank | torchtext.datasets.PennTreebank | |
torchtext.legacy.datasets.EnWik9 | torchtext.datasets.EnWik9 | |
Text Classification | torchtext.legacy.datasets.AG_NEWS | torchtext.datasets.AG_NEWS |
torchtext.legacy.datasets.SogouNews | torchtext.datasets.SogouNews | |
torchtext.legacy.datasets.DBpedia | torchtext.datasets.DBpedia | |
torchtext.legacy.datasets.YelpReviewPolarity | torchtext.datasets.YelpReviewPolarity | |
torchtext.legacy.datasets.YelpReviewFull | torchtext.datasets.YelpReviewFull | |
torchtext.legacy.datasets.YahooAnswers | torchtext.datasets.YahooAnswers | |
torchtext.legacy.datasets.AmazonReviewPolarity | torchtext.datasets.AmazonReviewPolarity | |
torchtext.legacy.datasets.AmazonReviewFull | torchtext.datasets.AmazonReviewFull | |
torchtext.legacy.datasets.IMDB | torchtext.datasets.IMDB | |
torchtext.legacy.datasets.SST | deferred | |
torchtext.legacy.datasets.TREC | deferred | |
Sequence Tagging | torchtext.legacy.datasets.UDPOS | torchtext.datasets.UDPOS |
torchtext.legacy.datasets.CoNLL2000Chunking | torchtext.datasets.CoNLL2000Chunking | |
Translation | torchtext.legacy.datasets.WMT14 | deferred |
torchtext.legacy.datasets.Multi30k | deferred | |
torchtext.legacy.datasets.IWSLT | torchtext.datasets.IWSLT2016, torchtext.datasets.IWSLT2017 | |
Natural Language Inference | torchtext.legacy.datasets.XNLI | deferred |
torchtext.legacy.datasets.SNLI | deferred | |
torchtext.legacy.datasets.MultiNLI | deferred | |
Question Answer | torchtext.legacy.datasets.BABI20 | deferred |
metrics
/utils
/functional
from torchtext.legacy.data
(#1229)torchtext.datasets
, and move torchtext.datasets.common
to torchtext.data.dataset_utils
(#1188, #1145)test_iwslt()
(#1192)torchtext.experimental.datasets.raw
folder to torchtext.datasets
folder (#1182, #1202, #1207, #1211, #1212)add_docstring_header()
to generate docstring (#1185)download_from_url()
func and skip unnecessary download if the downloaded files are detected (#1158, #1155)MD5
and NUM_LINES
as the meta information in the __init__
file of torchtext.datasets
folder (#1155)torchtext.data.utils.get_tokenizer
func with the full name when Spacy tokenizers are loaded (#1140)assertEqual()
in PyTorch TestCase class (#1086)TORCH_LIBRARY_FRAGMENT
macro (#1102)setup_iter()
func in RawTextIterableDataset
(#1142)pytorch/text
website (#1227)