Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
ohnomore
into SyntaxDot.XlmRobertaTokenizer
.Add support for parallelizing annotation at the batch level. SyntaxDot has
so far used PyTorch inter/intraop parallelization. This change adds
support for parallelization at the batch level. Annotation-level
parallelization can be configured with the annotation-threads
command-line option of syntaxdot annotate
.
Add ReLU (relu
) as an option as the non-linearity in the feed-forward
transformer layers. This is much faster for systems where no vectorized
version of the normal distribution CDF is available (currently Apple M1).
The non-linearity that is used in the biaffine feed-forward layers is now configurable. For example:
[biaffine]
activation = "relu"
When this option is absent, the GELU activation (gelu
) will be used as
the default.
The license of SyntaxDot has changed from the Blue Oak Model License 1.0 to the MIT License or Apache License version 2.0 (at your option).
SyntaxDot now uses dynamic batch sizes. Before this change, the batch
size (--batch-size
) was specified as the number of sentences per
batch. Since sentences are sorted by length before batching, annotation
is performed on batches with roughly equisized sequences. However,
later batches required more computations per batch due to longer
sequence lengths.
This change replaces the --batch-size
option by the --max-batch-pieces
option. This option specifies the number of word/sentence pieces that
a batch should contain. SyntaxDot annotation creates batches that contains
at most that number of pieces. The only exception are single sentences
that are longer than the maximum number of batch pieces.
With this change, annotating each batch is approximately the same amount of work. This leads to approximately 10% increase in performance.
Since the batch size is not fixed anymore, the readahead (--readahead
)
is now specified in number of sentences.
Update to libtorch 1.9.0 and tch 0.5.0.
Change the default number of inter/intraop threads to 1. Use 4 threads for annotation-level parallelization. This has shown to be faster for all models, both on AMD Ryzen and Apple M1.
First beta for SyntaxDot 0.4.0, primarily for testing the release build.
You can also download ready-to-use models.
biaffine
configuration
option.model.pooler
option to mean
. The old behavior of discarding continuation pieces is used when this option is set to discard
.keep-best
option to the finetune
and distill
subcommands. With this option only the parameter files for the N best epochs/steps are retained during distillation.embeddings
rather than encoder
in BERT/RoBERTa models. Warning: This breaks compatibility with BERT and RoBERTa models from prior versions of SyntaxDot and sticker2, which should be retrained.Tokenizer
are now required to put a piece that marks the beginning of a sentence before the first token piece. BertTokenizer
was the only tokenizer that did not fulfill this requirement. BertTokenizer
is updated to insert the [CLS]
piece as a beginning of sentence marker. Warning: this breaks existing models with tokenizer = "bert"
, which should be retrained.tch
) by fallible counterparts, this makes exceptions thrown by Torch far easier to read.eprintln!
macro are replaced by logging using log
and env_logger
. The verbosity of the logs can be controlled with the RUST_LOG
environment variable (e.g. RUST_LOG=info
).tfrecord
by our own minimalist TensorBoard summary writing, removing 92 dependencies.SequenceClassifiers::top_k
.Second beta for 0.3.0.
Third beta of 0.3.0.
Add keep-best
option to the finetune
command. With this option only the parameter files for the N best epochs are retained during distillation. The same option for distill
is renamed from keep-best-steps
to keep-best
.
Add keep-best-steps
option to the distill
subcommand.