PyTorch NLP Versions Save

Basic Utilities for PyTorch Natural Language Processing (NLP)

0.5.0

4 years ago

Major Updates

  • Updated my README emoji game to be more ambiguous while maintaining fun and heartwarming vibe. 🐕
  • Support for Python 3.5
  • Extensive rewrite of README to focus on new users and building an NLP pipeline.
  • Support for Pytorch 1.2
  • Added torchnlp.random for finer grain control of random state building on PyTorch's fork_rng. This module controls the random state of torch, numpy and random.
import random
import numpy
import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123):  # Ensure determinism
    print('Random:', random.randint(1, 2**31))
    print('Numpy:', numpy.random.randint(1, 2**31))
    print('Torch:', int(torch.randint(1, 2**31, (1,))))
  • Refactored torchnlp.samplers enabling pipelining. For example:
from torchnlp.samplers import DeterministicSampler
from torchnlp.samplers import BalancedSampler

data = ['a', 'b', 'c'] + ['c'] * 100
sampler = BalancedSampler(data, num_samples=3)
sampler = DeterministicSampler(sampler, random_seed=12)
print([data[i] for i in sampler])  # ['c', 'b', 'a']
  • Added torchnlp.samplers.balanced_sampler for balanced sampling extending Pytorch's WeightedRandomSampler.
  • Added torchnlp.samplers.deterministic_sampler for deterministic sampling based on torchnlp.random.
  • Added torchnlp.samplers.distributed_batch_sampler for distributed batch sampling.
  • Added torchnlp.samplers.oom_batch_sampler to sample large batches first in order to force an out-of-memory error.
  • Added torchnlp.utils.lengths_to_mask to help create masks from a batch of sequences.
  • Added torchnlp.utils.get_total_parameters to measure the number of parameters in a model.
  • Added torchnlp.utils.get_tensors to measure the size of an object in number of tensor elements. This is useful for dynamic batch sizing and for torchnlp.samplers.oom_batch_sampler.
from torchnlp.utils import get_tensors

random_object_ = tuple([{'t': torch.tensor([1, 2])}, torch.tensor([2, 3])])
tensors = get_tensors(random_object_)
assert len(tensors) == 2

Minor Updates

0.4.0

5 years ago

Major updates

  • Rewrote encoders to better support more generic encoders like a LabelEncoder. Furthermore, added broad support for batch_encode, batch_decode and enforce_reversible.
  • Rearchitected default reserved tokens to ensure configurability while still providing the convenience of good defaults.
  • Added support to collate sequences with torch.utils.data.dataloader.DataLoader. For example:
from functools import partial
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors

collate_fn = partial(collate_tensors, stack_tensors=stack_and_pad_tensors)
torch.utils.data.dataloader.DataLoader(*args, collate_fn=collate_fn, **kwargs)
  • Added doctest support ensuring the documented examples are tested.
  • Removed SRU support, it's too heavy of a module to support. Please use https://github.com/taolei87/sru instead. Happy to accept a PR with a better tested and documented SRU module!
  • Update version requirements to support Python 3.6 and 3.7, dropping support for Python 3.5.
  • Updated version requirements to support PyTorch 1.0+.
  • Merged https://github.com/PetrochukM/PyTorch-NLP/pull/66 reducing the memory requirements for pre-trained word vectors by 2x.

Minor Updates

  • Formatted the code base with YAPF.
  • Fixed pandas and collections warnings.
  • Added invariant assertion to Encoder via enforce_reversible. For example:
    encoder = Encoder().enforce_reversible()
    
    Ensuring Encoder.decode(Encoder.encode(object)) == object
  • Fixed the accuracy metric for PyTorch 1.0.

0.3.7.post1

5 years ago

Minor release fixing some issues and bugs.

0.3.0

6 years ago

Release 0.3.0

Major Features And Improvements

  • Upgraded to PyTorch 0.4.0
  • Added Byte-Pair Encoding (BPE) pre-trained subword embeddings in 275 languages
  • Refactored download scripts to torchnlp.downloads
  • Enable Spacy encoder to run in multiple languages.
  • Added a boolean aligned option to FastText supporting MUSE (Multilingual Unsupervised and Supervised Embeddings)

Bug Fixes and Other Changes

  • Create non-existent cache dirs for torchnlp.word_to_vector.
  • Add set operation to torchnlp.datasets.Dataset with support for slices, columns and rows
  • Updated biggest_batches_first in torchnlp.samplers to be more efficient at approximating memory then Pickle
  • Enabled torch.utils.pad_tensor and torch.utils. pad_batch to support N dimensional tensors
  • Updated to sacremoses to fix NLTK moses dependancy for torch.text_encoders
  • Added __getitem()__ for _PretrainedWordVectors. For example:
from torchnlp.word_to_vector import FastText
vectors = FastText()
tokenized_sentence = ['this', 'is', 'a', 'sentence']
vectors[tokenized_sentence]
  • Added __contains__ for _PretrainedWordVectors. For example:
>>> from torchnlp.word_to_vector import FastText
>>> vectors = FastText()

>>> 'the' in vectors
True
>>> 'theqwe' in vectors
False

0.2.0

6 years ago