Minimalist NMT for educational purposes
:koala: Joey NMT framework is developed for educational purposes. It aims to be a clean and minimalistic code base to help novices find fast answers to the following questions.
In contrast to other NMT frameworks, we will not aim for the most recent features or speed through engineering or training tricks since this often goes in hand with an increase in code complexity and a decrease in readability. :eyes:
However, Joey NMT re-implements baselines from major publications.
Check out the detailed documentation :books: and our paper. :newspaper:
Joey NMT was initially developed and is maintained by Jasmijn Bastings (University of Amsterdam) and Julia Kreutzer (Heidelberg University), now both at Google Research. Mayumi Ohta at Fraunhofer Institute is continuing the legacy.
Welcome to our new contributors :hearts:, please don't hesitate to open a PR or an issue if there's something that needs improvement!
Joey NMT implements the following features (aka the minimalist toolkit of NMT :wrench:):
Joey NMT is built on PyTorch. Please make sure you have a compatible environment. We tested Joey NMT v2.3 with
:warning: Warning When running on GPU you need to manually install the suitable PyTorch version for your CUDA version. For example, you can install PyTorch 2.1.2 with CUDA v12.1 as follows:
python -m pip install --upgrade torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
You can install Joey NMT either A. via pip or B. from source.
python -m pip install joeynmt
git clone https://github.com/joeynmt/joeynmt.git # Clone this repository
cd joeynmt
python -m pip install -e . # Install Joey NMT and it's requirements
python -m unittest # Run the unit tests
:memo: Info For Windows users, we recommend to check whether txt files (i.e.
test/data/toy/*
) have utf-8 encoding.
torchtext.legacy
dependencies are completely replaced by torch.utils.data
joeynmt/tokenizers.py
: handles tokenization internally (also supports bpe-dropout!)joeynmt/datasets.py
: loads data from plaintext, tsv, and huggingface's datasets
scripts/build_vocab.py
: trains subwords, creates joint vocab:warning: Warning The models trained with Joey NMT v1.x can be decoded with Joey NMT v2.0. But there is no guarantee that you can reproduce the same score as before.
We also updated the documentation thoroughly for Joey NMT 2.0!
For details, follow the tutorials in notebooks dir.
:warning: Warning For Joey NMT v1.x, please refer the archive here.
Joey NMT has 3 modes: train
, test
, and translate
, and all of them takes a
YAML-style config file as argument.
You can find examples in the configs
directory.
transformer_small.yaml
contains a detailed explanation of configuration options.
Most importantly, the configuration contains the description of the model architecture (e.g. number of hidden units in the encoder RNN), paths to the training, development and test data, and the training hyperparameters (learning rate, validation frequency etc.).
:memo: Info Note that subword model training and joint vocabulary creation is not included in the 3 modes above, has to be done separately. We provide a script that takes care of it:
scritps/build_vocab.py
.python scripts/build_vocab.py configs/transformer_small.yaml --joint
train
modeFor training, run
python -m joeynmt train configs/transformer_small.yaml
This will train a model on the training data, validate on validation data, and store
model parameters, vocabularies, validation outputs. All needed information should be
specified in the data
, training
and model
sections of the config file (here
configs/transformer_small.yaml
).
model_dir/
├── *.ckpt # checkpoints
├── *.hyps # translated texts at validation
├── config.yaml # config file
├── spm.model # sentencepiece model / subword-nmt codes file
├── src_vocab.txt # src vocab
├── trg_vocab.txt # trg vocab
├── train.log # train log
└── validation.txt # validation scores
:bulb: Tip Be careful not to overwrite
model_dir
, setoverwrite: False
in the config file.
test
modeThis mode will generate translations for validation and test set (as specified in the
configuration) in model_dir/out.[dev|test]
.
python -m joeynmt test configs/transformer_small.yaml
You can specify the ckpt path explicitly in the config file. If load_model
is not given
in the config, the best model in model_dir
will be used to generate translations.
You can specify i.e. sacrebleu options in the
test
section of the config file.
:bulb: Tip
scripts/average_checkpoints.py
will generate averaged checkpoints for you.python scripts/average_checkpoints.py --inputs model_dir/*00.ckpt --output model_dir/avg.ckpt
If you want to output the log-probabilities of the hypotheses or references, you can
specify return_score: 'hyp'
or return_score: 'ref'
in the testing section of the
config. And run test
with --output_path
and --save_scores
options.
python -m joeynmt test configs/transformer_small.yaml --output-path model_dir/pred --save-scores
This will generate model_dir/pred.{dev|test}.{scores|tokens}
which contains scores and corresponding tokens.
:memo: Info
- If you set
return_score: 'hyp'
with greedy decoding, then token-wise scores will be returned. The beam search will return sequence-level scores, because the scores are summed up per sequence during beam exploration.- If you set
return_score: 'ref'
, the model looks up the probabilities of the given ground truth tokens, and both decoding and evaluation will be skipped.- If you specify
n_best
>1 in config, the first translation in the nbest list will be used in the evaluation.
translate
modeThis mode accepts inputs from stdin and generate translations.
File translation
python -m joeynmt translate configs/transformer_small.yaml < my_input.txt > output.txt
Interactive translation
python -m joeynmt translate configs/transformer_small.yaml
You'll be prompted to type an input sentence. Joey NMT will then translate with the model specified in the config file.
:bulb: Tip Interactive
translate
mode doesn't work with Multi-GPU. Please run it on single GPU or CPU.
We trained this multilingual model with JoeyNMT v2.3.0 using DDP.
Direction | Architecture | tok | dev | test | #params | download |
---|---|---|---|---|---|---|
en->de | Transformer | sentencepiece | - | 28.88 | 200M | iwslt14_prompt |
de->en | - | 35.28 | ||||
en->fr | - | 38.86 | ||||
fr->en | - | 40.35 |
sacrebleu signature: nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.4.0
We trained the models with JoeyNMT v2.1.0 from scratch.
cf) wmt14 deen leaderboard in paperswithcode
Direction | Architecture | tok | dev | test | #params | download |
---|---|---|---|---|---|---|
en->de | Transformer | sentencepiece | 24.36 | 24.38 | 60.5M | wmt14_ende.tar.gz (766M) |
de->en | Transformer | sentencepiece | 30.60 | 30.51 | 60.5M | wmt14_deen.tar.gz (766M) |
sacrebleu signature: nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.2.0
:warning: Warning The following models are trained with JoeynNMT v1.x, and decoded with Joey NMT v2.0. See
config_v1.yaml
andconfig_v2.yaml
in the linked zip, respectively. Joey NMT v1.x benchmarks are archived here.
Pre-processing with Moses decoder tools as in this script.
Direction | Architecture | tok | dev | test | #params | download |
---|---|---|---|---|---|---|
de->en | RNN | subword-nmt | 31.77 | 30.74 | 61M | rnn_iwslt14_deen_bpe.tar.gz (672MB) |
de->en | Transformer | subword-nmt | 34.53 | 33.73 | 19M | transformer_iwslt14_deen_bpe.tar.gz (221MB) |
sacrebleu signature: nrefs:1|case:lc|eff:no|tok:13a|smooth:exp|version:2.0.0
:memo: Info For interactive translate mode, you should specify
pretokenizer: "moses"
in the both src's and trg'stokenizer_cfg
, so that you can input raw sentence. ThenMosesTokenizer
andMosesDetokenizer
will be applied internally. For test mode, we used the preprocessed texts as input and setpretokenizer: "none"
in the config.
We picked the pretrained models and configs (bpe codes file etc.) from masakhane.io.
Direction | Architecture | tok | dev | test | #params | download |
---|---|---|---|---|---|---|
af->en | Transformer | subword-nmt | - | 57.70 | 46M | transformer_jw300_afen.tar.gz (525MB) |
en->af | Transformer | subword-nmt | 47.24 | 47.31 | 24M | transformer_jw300_enaf.tar.gz (285MB) |
sacrebleu signature: nrefs:1|case:mixed|eff:no|tok:intl|smooth:exp|version:2.0.0
For training, we split JparaCrawl v2 into train and dev set and trained a model on them. Please check the preprocessing script here. We tested then on kftt test set and wmt20 test set, respectively.
Direction | Architecture | tok | wmt20 | kftt | #params | download |
---|---|---|---|---|---|---|
en->ja | Transformer | sentencepiece | 17.66 | 14.31 | 225M | jparacrawl_enja.tar.gz (2.3GB) |
ja->en | Transformer | sentencepiece | 14.97 | 11.49 | 221M | jparacrawl_jaen.tar.gz (2.2GB) |
sacrebleu signature:
nrefs:1|case:mixed|eff:no|tok:ja-mecab-0.996-IPA|smooth:exp|version:2.0.0
nrefs:1|case:mixed|eff:no|tok:intl|smooth:exp|version:2.0.0
Note: In wmt20 test set, newstest2020-enja
has 1000 examples, newstest2020-jaen
has 993 examples.
In order to keep the code clean and readable, we make use of:
test/unit/
.docs/source/
accordingly.To ensure the repository stays clean, unittests and linters are triggered by github's
workflow on every push or pull request to main
branch. Before you create a pull request,
you can check the validity of your modifications with the following commands:
make test
make check
make -C docs clean html
Since this codebase is supposed to stay clean and minimalistic, contributions addressing the following are welcome:
Code extending the functionalities beyond the basics will most likely not end up in the main branch, but we're curious to learn what you used Joey NMT for.
Here we'll collect projects and repositories that are based on Joey NMT, so you can find inspiration and examples on how to modify and extend the code.
If you used Joey NMT for a project, publication or built some code on top of it, let us know and we'll link it here.
Please leave an issue if you have questions or issues with the code.
For general questions, email us at joeynmt <at> gmail.com
. :love_letter:
If you use Joey NMT in a publication or thesis, please cite the following paper:
@inproceedings{kreutzer-etal-2019-joey,
title = "Joey {NMT}: A Minimalist {NMT} Toolkit for Novices",
author = "Kreutzer, Julia and
Bastings, Jasmijn and
Riezler, Stefan",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-3019",
doi = "10.18653/v1/D19-3019",
pages = "109--114",
}
Joeys are infant marsupials. :koala: