EasyNMT Versions Save

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

v2.0.0

2 years ago

mbart50 & m2m models now use huggingface transformers

The mbart50 & m2m models required in version 1 the fairseq library. This caused several issues: fairseq cannot be used on Windows, multi-processing did not work with fairseq models, loading and using the models were quite complicated.

With this release, the fairseq dependency is removed and mbart50 / m2m models are loaded with huggingface transformers version >= 4.4.0

From a user perspective, no changes should be visible. But from a developer perspective, this simplifies the architecture of EasyNMT and allows new futures more easily be integrated.

Saving models

Models can now be saved to disc by calling:

model.save(output_path)

Models can be loaded from disc by calling:

model = EasyNMT(output_path)

Loadings models from huggingface model hub

Loading of any Huggingface Translation Model is now simple. Simply pass the name or the model path to the following code:

from easynmt import EasyNMT, models
article = """EasyNMT is an open source library for state-of-the-art neural machine translation. Installation is simple using
pip or pre-build docker images. EasyNMT provides access to various neural machine translation models. It can translate 
sentences and documents of any length. Further, it includes code to automatically detect the language of a text."""

model = EasyNMT(translator=models.AutoModel('facebook/mbart-large-en-ro')) 
print(model.translate(article, source_lang='en_XX', target_lang='ro_RO'))

This loads the facebook/mbart-large-en-ro model from the model hub.

Note: Models might use different language codes, e.g. the mbart model uses 'en_XX' instead of 'en' and 'ro_RO' instead of 'ro'. To make the language code consistent, you can pass a lang_map:

from easynmt import EasyNMT, models

article = """EasyNMT is an open source library for state-of-the-art neural machine translation. Installation is simple using
pip or pre-build docker images. EasyNMT provides access to various neural machine translation models. It can translate 
sentences and documents of any length. Further, it includes code to automatically detect the language of a text."""

output_path = 'output/mbart-large-en-ro'
model = EasyNMT(translator=models.AutoModel('facebook/mbart-large-en-ro', lang_map={'en': 'en_XX', 'ro': 'ro_RO'}))

#Save the model to disc
model.save(output_path)

# Load the model from disc
model = EasyNMT(output_path)
print(model.translate(article,  target_lang='ro'))

v1.1.0

3 years ago

This release brings several improvements and is the first step towards the release of a Docker Image + REST API.

Improvements:

  • Docker REST API: We have published Docker images for a REST API, that allows the easy usage of EasyNMT. Just run the Docker image and starts translating using REST API calls: more info
  • Google Colab REST API Hosting: We have published a colab notenbook that shows to to wrap EasyNMT in a REST API and host it on Google Colab with a free GPU. Useful if you need to translate large amounts.
  • Long sentences are translated first: Sentences are sorted before they are translated in order to waste minimal time with padding tokens. In the previous version, the shortest sentences were translated first and then later the longer sentences. Now the order is reversed. This has several advantages: If an OOM happens, it happens at the start of the translation process and not at the end. Also, the estimate from the progress bar is more accurate as the longest and slowest sentences are now translated first.
  • Improve language detection: Automatic language is still an issue, especially for mixed languages. Language detection is now performed on document level and not on sentence level. If you need sentence level lang. detection on sentence level you can set document_language_detection=False for the translate method. Also, text is now lower cased before the language is detected (the lang. detection scripts had issues with all upper case text
  • Max length parameter: When you create your model like this: model = EasyNMT(model_name, max_length=100), then all sentences with more than 100 word pieces will be truncated to at max 100 word pieces. This can prevent OOM with too long sentences.
  • Load model without translator: If you just want to use the language detection methods, you can now load your model like model = EasyNMT(model_name, load_translator=False). This will prevent the loading of the translation engine.

Roadmap

  • As soon as Huggingface transformers v4.4.0 is released, the dependency on fairseq can be removed as the mBART50 and m2m models will be available in HF transformers. This will make the installation on a Windows machine possible

v1.0.2

3 years ago

fastText is used for automatic language detection, as it provides the highest speed and best accuracy.

However, it can be complicated to install it on Windows as it requires a C/C++ compiler.

This release adds two alternative language identifiers:

If fastText is not available, langid / langdetect will be used as alternative language detection methods.

For installation on Windows, you can run the following commands:

pip install --no-deps easynmt
pip install tqdm transformers numpy nltk sentencepiece langid 

Further, you have to install pytorch as described here: https://pytorch.org/get-started/locally/

If you want to install fastText on Windows, I can recommend this link: https://anaconda.org/conda-forge/fasttext

v1.0.1

3 years ago

fastText language detection did not work well if the text was in UPPERCASE.

Adding lower() to the string before the language identification step significantly improved the performance.

v1.0.0

3 years ago

First release of EasyNMT - Easy-to-use, state-of-the-art machine translation using transformers architecture.