Stanza Versions Save

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

v1.8.2

3 weeks ago

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

MWT improvements

Peft memory management

Other bugfixes and minor upgrades

Other upgrades

v1.8.1

2 months ago

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

Features

Bugfixes

Additional 1.8.1 Bugfixes

v1.8.0

2 months ago

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

Features

Bugfixes

v1.7.0

5 months ago

Neural coref processor added!

Conjunction-Aware Word-Level Coreference Resolution https://arxiv.org/abs/2310.06165 original implementation: https://github.com/KarelDO/wl-coref/tree/master

Updated form of Word-Level Coreference Resolution https://aclanthology.org/2021.emnlp-main.605/ original implementation: https://github.com/vdobrovolskii/wl-coref

If you use Stanza's coref module in your work, please be sure to cite both of the above papers.

Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.

Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.

Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models

https://github.com/stanfordnlp/stanza/pull/1309

Interface change: English MWT

English now has an MWT model by default. Text such as won't is now marked as a single token, split into two words, will and not. Previously it was expected to be tokenized into two pieces, but the Sentence object containing that text would not have a single Token object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.

Code that used to operate with for word in sentence.words will continue to work as before, but for token in sentence.tokens will now produce one object for MWT such as won't, cannot, Stanza's, etc.

Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline creation time if the language and package includes MWT.

https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1

Other updates

Updated requirements

  • Support dropped for python 3.6 and 3.7. The peft module used for finetuning the transformer used in the coref processor does not support those versions.
  • Added peft as an optional dependency to transformer based installations
  • Added networkx as a dependency for reading enhanced dependencies. Added toml as a dependency for reading the coref config.

v1.6.1

7 months ago

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the setup.py classifiers. This will be the last release that supports Python 3.6

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

  • default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
  • default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
  • default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Bugfixes

v1.6.0

7 months ago

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

  • default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
  • default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
  • default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Bugfixes

v1.5.1

8 months ago

Features

depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a

Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 https://github.com/stanfordnlp/stanza/commit/a87ffd0a4f43262457cf7eecf5555a621c6dc24e

Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398

SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285

Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1

Tokenize [] based on () rules if the original dataset doesn't have [] in it https://github.com/stanfordnlp/stanza/pull/1282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9

Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0

Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74

Bugfixes

Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 https://github.com/stanfordnlp/stanza/commit/4dda14bd585893044708c70e30c1c3efec509863 https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013

prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges https://github.com/stanfordnlp/stanza/commit/78ff85ce7eed596ad195a3f26474065717ad63b3

Fix an empty bulk_process throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278

Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb

Minor updates

Put NER and POS scores on one line to make it easier to grep for: https://github.com/stanfordnlp/stanza/commit/da2ae33e8ef9e48842685dfed88896b646dba8c4 https://github.com/stanfordnlp/stanza/commit/8c4cb04d38c1101318755270f3aa75c54236e3fe

Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others

Pipeline uses torch.no_grad() for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3

Generalize save names, which eventually allows for putting transformer, charlm or nocharlm in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models

Add the model's flags to the --help for the run scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600

Remove the dependency on six https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you @BLKSerene )

New Models

VLSP constituency https://github.com/stanfordnlp/stanza/commit/500435d3ec1b484b0f1152a613716565022257f2

VLSP constituency -> tagging https://github.com/stanfordnlp/stanza/commit/cb0f22d7be25af0b3b2790e3ce1b9dbc277c13a7

CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31

Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f

Added an Indonesian charlm

Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218

All languages with pretrained charlms now have an option to use that charlm for dependency parsing

French combined models out of GSD, ParisStories, Rhapsodie, and Sequoia https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8b

UD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058

v1.5.0

1 year ago

Ssurgeon interface

Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/

In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.

CoreNLP integration:

Bugfixes:

Features:

New models:

Conparser experiments:

v1.4.2

1 year ago

Stanza v1.4.2: Minor version bump to improve (python) dependencies

v1.4.1

1 year ago

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

Other new models

Model improvements

Pipeline interface improvements

Bugfixes

Improved training tools