Helsinki Neural Machine Translation system
This is a neural network-based machine translation system developed at the University of Helsinki and (now) Stockholm University.
HNMT is the best system for English-to-Finnish translation according to both the manual and automatic evaluations done for the official WMT 2017 results. Our system description paper (camera-ready copy, due to be published in the proceedings of the WMT 2017, September 2017) describes the design and implementation of our system in detail, and contains evaluations of different features.
There has been a number of changes to the interface, due to a rewrite of the data loading code so that not all training data is loaded into RAM. This reduces memory consumption considerably.
--train
argument. The --source
and
--target
arguments should not be used.--heldout-source
and --heldout-target
(as opposed to the training data,
these must be contained in two separate files for the source and target
language).make_encoder.py
which does this. One should be created for each of the
source and target texts, and loaded with --load-source-vocabulary
and
--load-target-vocabulary
respectively.--beam-budget
have changed a bit, but the acceptable
values should be roughly the same as before, and depends on model size and
GPU RAM but not on sentence length. --batch-size
is only used during
translation.--backwards yes
to train a model where all the input is
reversed (on the character level). Currently the output is kept reversed,
but this is subject to modification.If Theano and BNAS are installed, you should be able to simply run
hnmt.py
. Run with the --help
argument to see the available command-line
options.
Training a model on the Europarl corpus can be done like this:
python3 make_encoder.py --min-char-count 2 --tokenizer word \
--hybrid --vocabulary 50000 \
--output vocab.sv europarl-v7.sv-en.sv
python3 make_encoder.py --min-char-count 2 --tokenizer char \
--output vocab.en europarl-v7.sv-en.en
python3 hnmt.py --train europarl-v7.sv-en \
--source-tokenizer word \
--target-tokenizer char \
--heldout-source dev.sv \
--heldout-target dev.en \
--load-source-vocabulary vocab.sv \
--load-target-vocabulary vocab.en \
--batch-budget 32 \
--training-time 24 \
--log en-sv.log \
--save-model en-sv.model
This will create a model with a hybrid encoder (with 50k vocabulary size and
character-level encoding for the rest) and character-based decoder, and train
it for 24 hours. Development set cross-entropy and some other statistics
appended to this file, which is usually the best way of monitoring training.
Training loss and development set translations will be written to stdout, so
redirecting this or using tee
is recommended.
Note that --heldout-source
and --heldout-target
are mandatory, and that
while the training data contains sentence pairs separated by ||| in the same
file, the heldout sentences (which are only used for monitoring during
training) are separated into two files.
The resulting model can be used like this:
python3 hnmt.py --load-model en-sv.model \
--translate test.en --output test.sv \
--beam-size 10
Note that when training a model from scratch, parameters can be set on the
commandline or otherwise the hard-coded defaults are used. When continuing
training or doing translation (i.e. whenever the --load-model
argument is
used), the defaults are encoded in the given model file, although some of
these (that do not change the network structure) can still be overridden by
commandline arguments.
For instance, the model above will assume that input files need to be tokenized, but passing a pre-tokenized (space-separated) input can be done as follows:
python3 hnmt.py --load-model en-sv.model \
--translate test.en --output test.sv \
--source-tokenizer space \
--beam-size 10
You can resume training by adding the --load-model
argument without using
--translate
(which disables training). For instance, if you want to keep
training the model above for another 48 hours on the same data:
python3 hnmt.py --load-model en-sv.model
--training-time 48 \
--save-model en-sv-72h.model
Select the tokenizer among these options:
TODO: support BPE as internal segmentation (apply_bpe to training data)
During training, the *.log
file reports the following information (in order, one column per item):
The *.log.eval
file reports evaluation metrics on the heldout set (in order, one column per item):