PyTorch NLP Versions Save

Basic Utilities for PyTorch Natural Language Processing (NLP)

0.5.0

4 years ago

Major Updates

Updated my README emoji game to be more ambiguous while maintaining fun and heartwarming vibe. 🐕
Support for Python 3.5
Extensive rewrite of README to focus on new users and building an NLP pipeline.
Support for Pytorch 1.2
Added torchnlp.random for finer grain control of random state building on PyTorch's fork_rng. This module controls the random state of torch, numpy and random.

import random
import numpy
import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123):  # Ensure determinism
    print('Random:', random.randint(1, 2**31))
    print('Numpy:', numpy.random.randint(1, 2**31))
    print('Torch:', int(torch.randint(1, 2**31, (1,))))

Refactored torchnlp.samplers enabling pipelining. For example:

from torchnlp.samplers import DeterministicSampler
from torchnlp.samplers import BalancedSampler

data = ['a', 'b', 'c'] + ['c'] * 100
sampler = BalancedSampler(data, num_samples=3)
sampler = DeterministicSampler(sampler, random_seed=12)
print([data[i] for i in sampler])  # ['c', 'b', 'a']

Added torchnlp.samplers.balanced_sampler for balanced sampling extending Pytorch's WeightedRandomSampler.
Added torchnlp.samplers.deterministic_sampler for deterministic sampling based on torchnlp.random.
Added torchnlp.samplers.distributed_batch_sampler for distributed batch sampling.
Added torchnlp.samplers.oom_batch_sampler to sample large batches first in order to force an out-of-memory error.
Added torchnlp.utils.lengths_to_mask to help create masks from a batch of sequences.
Added torchnlp.utils.get_total_parameters to measure the number of parameters in a model.
Added torchnlp.utils.get_tensors to measure the size of an object in number of tensor elements. This is useful for dynamic batch sizing and for torchnlp.samplers.oom_batch_sampler.

from torchnlp.utils import get_tensors

random_object_ = tuple([{'t': torch.tensor([1, 2])}, torch.tensor([2, 3])])
tensors = get_tensors(random_object_)
assert len(tensors) == 2

Added a corporate sponsor to the library: https://wellsaidlabs.com/

Minor Updates

Fixed snli example (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Updated .gitignore to support Python's virtual environments (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Removed requests and pandas dependency. There are only two dependencies remaining. This is useful for production environments. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Added LazyLoader to reduce dependency requirements. (https://github.com/PetrochukM/PyTorch-NLP/commit/4e84780a8a741d6a90f2752edc4502ab2cf89ecb)
Removed unused torchnlp.datasets.Dataset class in favor of basic Python dictionary lists and pandas. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Support for downloading tar.gz files and unpacking them faster. (https://github.com/PetrochukM/PyTorch-NLP/commit/eb61fee854576c8a57fd9a20ee03b6fcb89c493a)
Rename itos and stoi to index_to_token and token_to_index respectively. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Fixed batch_encode, batch_decode, and enforce_reversible for torchnlp.encoders.text (https://github.com/PetrochukM/PyTorch-NLP/pull/69)
Fix FastText vector downloads (https://github.com/PetrochukM/PyTorch-NLP/pull/72)
Fixed documentation for LockedDropout (https://github.com/PetrochukM/PyTorch-NLP/pull/73)
Fixed bug in weight_drop (https://github.com/PetrochukM/PyTorch-NLP/pull/76)
stack_and_pad_tensors now returns a named tuple for readability (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Added torchnlp.utils.split_list in favor of torchnlp.utils.resplit_datasets. This is enabled by the modularity of torchnlp.random. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Deprecated torchnlp.utils.datasets_iterator in favor of Pythons itertools.chain. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Deprecated torchnlp.utils.shuffle in favor of torchnlp.random. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
Support encoding larger datasets following fixing this issue (https://github.com/PetrochukM/PyTorch-NLP/issues/85).
Added torchnlp.samplers.repeat_sampler following up on this issue: https://github.com/pytorch/pytorch/issues/15849

0.4.0

5 years ago

Major updates

Rewrote encoders to better support more generic encoders like a LabelEncoder. Furthermore, added broad support for batch_encode, batch_decode and enforce_reversible.
Rearchitected default reserved tokens to ensure configurability while still providing the convenience of good defaults.
Added support to collate sequences with torch.utils.data.dataloader.DataLoader. For example:

from functools import partial
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors

collate_fn = partial(collate_tensors, stack_tensors=stack_and_pad_tensors)
torch.utils.data.dataloader.DataLoader(*args, collate_fn=collate_fn, **kwargs)

Added doctest support ensuring the documented examples are tested.
Removed SRU support, it's too heavy of a module to support. Please use https://github.com/taolei87/sru instead. Happy to accept a PR with a better tested and documented SRU module!
Update version requirements to support Python 3.6 and 3.7, dropping support for Python 3.5.
Updated version requirements to support PyTorch 1.0+.
Merged https://github.com/PetrochukM/PyTorch-NLP/pull/66 reducing the memory requirements for pre-trained word vectors by 2x.

Minor Updates

Formatted the code base with YAPF.
Fixed pandas and collections warnings.
Added invariant assertion to Encoder via enforce_reversible. For example:
```
encoder = Encoder().enforce_reversible()
```
Ensuring Encoder.decode(Encoder.encode(object)) == object
Fixed the accuracy metric for PyTorch 1.0.

0.3.7.post1

5 years ago

Minor release fixing some issues and bugs.

0.3.0

6 years ago

Release 0.3.0

Major Features And Improvements

Upgraded to PyTorch 0.4.0
Added Byte-Pair Encoding (BPE) pre-trained subword embeddings in 275 languages
Refactored download scripts to torchnlp.downloads
Enable Spacy encoder to run in multiple languages.
Added a boolean aligned option to FastText supporting MUSE (Multilingual Unsupervised and Supervised Embeddings)

Bug Fixes and Other Changes

Create non-existent cache dirs for torchnlp.word_to_vector.
Add set operation to torchnlp.datasets.Dataset with support for slices, columns and rows
Updated biggest_batches_first in torchnlp.samplers to be more efficient at approximating memory then Pickle
Enabled torch.utils.pad_tensor and torch.utils. pad_batch to support N dimensional tensors
Updated to sacremoses to fix NLTK moses dependancy for torch.text_encoders
Added __getitem()__ for _PretrainedWordVectors. For example:

from torchnlp.word_to_vector import FastText
vectors = FastText()
tokenized_sentence = ['this', 'is', 'a', 'sentence']
vectors[tokenized_sentence]

Added __contains__ for _PretrainedWordVectors. For example:

>>> from torchnlp.word_to_vector import FastText
>>> vectors = FastText()

>>> 'the' in vectors
True
>>> 'theqwe' in vectors
False

0.2.0

6 years ago