Stanza Versions Save

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

v1.8.2

3 weeks ago

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

Add Old English (ANG) annotation! Thank you to @dmetola https://github.com/stanfordnlp/stanza/issues/1365

MWT improvements

Fix words ending with -nna split into MWT https://github.com/stanfordnlp/handparsed-treebank/commit/2c48d4093daddc790bf89d7b35c47ee4d7d272d1 https://github.com/stanfordnlp/stanza/issues/1366
Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) https://github.com/stanfordnlp/stanza/issues/1371 https://github.com/stanfordnlp/stanza/pull/1378
Mark start_char and end_char on an MWT if it is composed of exactly its subwords https://github.com/stanfordnlp/stanza/commit/23840891c37d54a5cf491ea58b0702987dd4a6d7 https://github.com/stanfordnlp/stanza/issues/1361

Peft memory management

Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. https://github.com/huggingface/peft/issues/1523 https://github.com/stanfordnlp/stanza/pull/1381 https://github.com/stanfordnlp/stanza/pull/1384

Other bugfixes and minor upgrades

Fix crash when trying to load previously unknown language https://github.com/stanfordnlp/stanza/issues/1360 https://github.com/stanfordnlp/stanza/commit/381736f8fb9b60a929002cc750bd0df3d7dad03a
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: https://github.com/stanfordnlp/stanza/commit/d180ae02b278dd09dff53bc910e7aa43656e944d https://github.com/stanfordnlp/stanza/issues/1367
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length https://github.com/stanfordnlp/stanza/commit/42718135e2ab4b145bbb5861d55bb9424ca3549f
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) https://github.com/stanfordnlp/stanza/commit/597d48f1ead89fa9a0cca86cf9f0b530ed249792

Other upgrades

Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation https://github.com/stanfordnlp/stanza/commit/57bfa8bbd8d3d42d4ee29d4a406640b126ce0f46 https://github.com/stanfordnlp/stanza/issues/1356#issuecomment-1981216912
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: https://github.com/stanfordnlp/stanza/commit/4048caed1b89030082d23b8f71d23bae6c9c54f1 https://github.com/stanfordnlp/stanza/commit/15b136bb30dda272d318a61a5f602e7fc81e7a31
Add a constituency model for German https://github.com/stanfordnlp/stanza/commit/7a4f48c738f0db8923aa5da88d0a9743eaee4c6a https://github.com/stanfordnlp/stanza/commit/86ddaab31c73a7d0a389d0557f3696c29d441657 https://github.com/stanfordnlp/stanza/issues/1368

v1.8.1

2 months ago

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

v1.8.0

2 months ago

Integrating PEFT into several different annotators

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https://github.com/stanfordnlp/stanza/pull/1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. https://github.com/stanfordnlp/stanza/pull/1335
NER also trained with peft: unfortunately, no consistent improvements to scores https://github.com/stanfordnlp/stanza/pull/1336
depparse includes peft: no consistent improvements yet https://github.com/stanfordnlp/stanza/pull/1337 https://github.com/stanfordnlp/stanza/pull/1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser https://github.com/stanfordnlp/stanza/pull/1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. https://github.com/stanfordnlp/stanza/pull/1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. https://github.com/stanfordnlp/stanza/pull/1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. https://github.com/stanfordnlp/stanza/pull/1346 https://github.com/stanfordnlp/stanza/issues/1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: https://github.com/stanfordnlp/stanza/issues/1315 https://github.com/stanfordnlp/stanza/pull/1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. https://github.com/stanfordnlp/stanza/pull/1331 https://github.com/stanfordnlp/stanza/issues/1330
wandb support for coref https://github.com/stanfordnlp/stanza/pull/1338
Coref annotator breaks length ties using POS if available https://github.com/stanfordnlp/stanza/issues/1326 https://github.com/stanfordnlp/stanza/commit/c4c3de5803f27843a5050e10ccae71b3fd9c45e9

Bugfixes

Using a proxy with download_resources_json was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: https://github.com/stanfordnlp/stanza/pull/1321 https://github.com/stanfordnlp/stanza/issues/1293 Thank you @sterliakov
Coref training rounding error https://github.com/stanfordnlp/stanza/pull/1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice https://github.com/stanfordnlp/stanza/pull/1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. https://github.com/stanfordnlp/stanza/pull/1350 https://github.com/stanfordnlp/stanza/issues/1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https://github.com/stanfordnlp/stanza/issues/1333 https://github.com/stanfordnlp/stanza/issues/1339 https://github.com/stanfordnlp/stanza/commit/f1fbaaad983e58dc3fcf318200d685663fb90737
Clarify error when a language is only partially handled: https://github.com/stanfordnlp/stanza/commit/da01644b4ba5ba477c36e5d2736012b81bcd00d4 https://github.com/stanfordnlp/stanza/issues/1310

v1.7.0

5 months ago

Neural coref processor added!

Conjunction-Aware Word-Level Coreference Resolution https://arxiv.org/abs/2310.06165 original implementation: https://github.com/KarelDO/wl-coref/tree/master

Updated form of Word-Level Coreference Resolution https://aclanthology.org/2021.emnlp-main.605/ original implementation: https://github.com/vdobrovolskii/wl-coref

If you use Stanza's coref module in your work, please be sure to cite both of the above papers.

Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.

Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.

Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models

https://github.com/stanfordnlp/stanza/pull/1309

Interface change: English MWT

English now has an MWT model by default. Text such as won't is now marked as a single token, split into two words, will and not. Previously it was expected to be tokenized into two pieces, but the Sentence object containing that text would not have a single Token object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.

Code that used to operate with for word in sentence.words will continue to work as before, but for token in sentence.tokens will now produce one object for MWT such as won't, cannot, Stanza's, etc.

Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline creation time if the language and package includes MWT.

https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1

Other updates

NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 https://github.com/stanfordnlp/stanza/pull/1295 https://github.com/stanfordnlp/stanza/pull/1298
Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) https://github.com/stanfordnlp/stanza/issues/1000 https://github.com/stanfordnlp/stanza/pull/1303
Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! https://github.com/stanfordnlp/stanza/pull/1302
Remove deprecated output methods such as conll_as_string and doc2conll_text. Use "{:C}".format(doc) instead https://github.com/stanfordnlp/stanza/commit/e01650f9c56382495082a9a24fa0310414c46651
Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
Sentences now have a doc_id field if the document they are created from has a doc_id. https://github.com/stanfordnlp/stanza/pull/1314/commits/8e2201f42cb99a5a3d8358ce59501c1d88f2585e
Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) https://github.com/stanfordnlp/stanza/pull/1314/commits/3d90d2b8a82048c5cea549b654e52544ed241833

Updated requirements

Support dropped for python 3.6 and 3.7. The peft module used for finetuning the transformer used in the coref processor does not support those versions.
Added peft as an optional dependency to transformer based installations
Added networkx as a dependency for reading enhanced dependencies. Added toml as a dependency for reading the coref config.

v1.6.1

7 months ago

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the setup.py classifiers. This will be the last release that supports Python 3.6

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline, you can now provide a callable via the tokenize_postprocessor parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline https://github.com/stanfordnlp/stanza/pull/1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb
charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

Bugfixes

V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: https://github.com/stanfordnlp/stanza/commit/b56f442d4d179c07411a44a342c224408eb6a6a9
Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824
run_ete.py was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7
Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef

v1.6.0

7 months ago

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

PR: https://github.com/stanfordnlp/stanza/pull/1287

addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

https://github.com/stanfordnlp/stanza/pull/1289

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

Future plans include using multiple NER datasets for other models as well.

Other features

Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline, you can now provide a callable via the tokenize_postprocessor parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline https://github.com/stanfordnlp/stanza/pull/1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb
charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a

Bugfixes

Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824
run_ete.py was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7
Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef

v1.5.1

8 months ago

Features

depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a

Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 https://github.com/stanfordnlp/stanza/commit/a87ffd0a4f43262457cf7eecf5555a621c6dc24e

Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398

SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285

Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1

Tokenize [] based on () rules if the original dataset doesn't have [] in it https://github.com/stanfordnlp/stanza/pull/1282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9

Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0

Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74

Bugfixes

Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 https://github.com/stanfordnlp/stanza/commit/4dda14bd585893044708c70e30c1c3efec509863 https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013

prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges https://github.com/stanfordnlp/stanza/commit/78ff85ce7eed596ad195a3f26474065717ad63b3

Fix an empty bulk_process throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278

Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb

Minor updates

Put NER and POS scores on one line to make it easier to grep for: https://github.com/stanfordnlp/stanza/commit/da2ae33e8ef9e48842685dfed88896b646dba8c4 https://github.com/stanfordnlp/stanza/commit/8c4cb04d38c1101318755270f3aa75c54236e3fe

Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others

Pipeline uses torch.no_grad() for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3

Generalize save names, which eventually allows for putting transformer, charlm or nocharlm in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models

Add the model's flags to the --help for the run scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600

Remove the dependency on six https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you @BLKSerene )

New Models

VLSP constituency https://github.com/stanfordnlp/stanza/commit/500435d3ec1b484b0f1152a613716565022257f2

VLSP constituency -> tagging https://github.com/stanfordnlp/stanza/commit/cb0f22d7be25af0b3b2790e3ce1b9dbc277c13a7

CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31

Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f

Added an Indonesian charlm

Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218

All languages with pretrained charlms now have an option to use that charlm for dependency parsing

French combined models out of GSD, ParisStories, Rhapsodie, and Sequoia https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8b

UD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058

v1.5.0

1 year ago

Ssurgeon interface

Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/

In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.

CoreNLP integration:

Ssurgeon interface! New interface allows for editing of dependency graphs using Semgrex patterns and Ssurgeon rules. https://github.com/stanfordnlp/stanza/pull/1205 https://aclanthology.org/2023.tlt-1.7/
English Morphology class (deterministic English lemmatizer) https://github.com/stanfordnlp/stanza/commit/6aed177731e883ce92057be7e78abdce3141a862
English constituency -> dependency converter https://github.com/stanfordnlp/stanza/commit/0987794c9e960b32ed75d5804dd5c586466ae061

Bugfixes:

Bugfix for older versions of torch: https://github.com/stanfordnlp/stanza/commit/376d7ea76248131a96d23e236ab165e7d5a544bb
Bugfix for training (integration with new scoring script) https://github.com/stanfordnlp/stanza/issues/1167 https://github.com/stanfordnlp/stanza/commit/9c39636c438cbeb00ab7a7e8d9caa0bcd31ccc44
Demo was showing constituency parser along with dependency parsing, even with conparse off: https://github.com/stanfordnlp/stanza/commit/cbc13b0219281f2c27e89ccf2914e13f8aa2bb1b
Replace absurdly long characters with UNK (thank you @khughitt) https://github.com/stanfordnlp/stanza/issues/1137 https://github.com/stanfordnlp/stanza/pull/1140
Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. https://github.com/stanfordnlp/stanza/commit/435685f875766e0b9b2b9b1d4792db1c452f9722
stanza-train NER training bugfix (wrong pretrain): https://github.com/stanfordnlp/stanza/commit/2757cb40edf7a4bf9f62e31eec4b3632ac5ebcb9
Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy https://github.com/stanfordnlp/stanza/issues/1209 https://github.com/stanfordnlp/stanza/pull/1159
Fix error in preparing tokenizer datasets (thanks @dvzubarev): https://github.com/stanfordnlp/stanza/pull/1161
Fix unnecessary slowness in preparing tokenizer datasets (again, thanks @dvzubarev): https://github.com/stanfordnlp/stanza/pull/1162
Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks @dvzubarev): https://github.com/stanfordnlp/stanza/pull/1170
When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): https://github.com/stanfordnlp/stanza/commit/b118473604d50d678c2857c0f39f59ba0cd9c2a3
Update use of emoji to match latest releases: https://github.com/stanfordnlp/stanza/issues/1195 https://github.com/stanfordnlp/stanza/commit/ea345a88f8916c2ab2cd2e6260caa7831dfe2f23

Features:

Mechanism for resplitting tokens into MWT https://github.com/stanfordnlp/stanza/issues/95 https://github.com/stanfordnlp/stanza/commit/8fac17f625173b2c2bf1cecf611deecb37399322
CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) https://github.com/stanfordnlp/stanza/commit/cfd44d17f806703b7ed6719993501366a52afbb1
detach().cpu() speeds things up significantly in some cases https://github.com/stanfordnlp/stanza/commit/ccfbc56b3b312fdde1350104a0d0d5645c9c80cc
Potentially use a constituency model as a classifier - WIP research project https://github.com/stanfordnlp/stanza/pull/1190
add an output format "{:C}" for document objects which prints out documents as CoNLL: https://github.com/stanfordnlp/stanza/pull/1169
If a constituency tree is available, include it when outputting conll format for documents: https://github.com/stanfordnlp/stanza/pull/1171
Same with sentiment: https://github.com/stanfordnlp/stanza/commit/abb581945a70fec335dbfadd71bf8c457fa908eb
Additional language code coverage (thank you @juanro49) https://github.com/stanfordnlp/stanza/commit/5802b10882026c4694a4d966e4200c48c5469b1b https://github.com/stanfordnlp/stanza/commit/f06bf86b566772ea6551c663835ddb9a6f5584ff https://github.com/stanfordnlp/stanza/commit/32f83fa2f2333f42925323c4ac9da059dffdf1dc https://github.com/stanfordnlp/stanza/commit/34505758c9d8de4ca70bfbe5418448ad54af088f
Allow loading a pipeline for new languages (useful when developing a new suite of models) https://github.com/stanfordnlp/stanza/commit/e7fcd262a6c5f3f71b339fe989bcaa177fb378f1
Script to count the work done by annotators on aws sagemaker private workforce: https://github.com/stanfordnlp/stanza/pull/1186
Streaming interface which batch processes items in the stream: https://github.com/stanfordnlp/stanza/commit/2c9fe3dad434b271fa23c20a9cf8ccaf63991f16 https://github.com/stanfordnlp/stanza/issues/550
Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: https://github.com/stanfordnlp/stanza/commit/70fd2fdc94575dec79c4994ea2dc66a719768ab0 https://github.com/stanfordnlp/stanza/issues/1199
Transformer at bottom layer of POS - currently only available in English as the en_combined_bert model, others to come https://github.com/stanfordnlp/stanza/pull/1132

New models:

Armenian NER model using an NER labeling of armtdp (thanks to @ShakeHakobyan): https://github.com/myavrum/ArmTDP-NER https://github.com/stanfordnlp/stanza/issues/1206 https://github.com/stanfordnlp/stanza/pull/1212
Sindhi tokenization from ISRA https://github.com/stanfordnlp/stanza/pull/1117
Sindhi NER from SiNER: https://github.com/stanfordnlp/stanza/commit/2a8ded4b0c327761b047caf433128f13b1ad14bf
Erzya from UD 2.11 https://github.com/stanfordnlp/stanza/commit/0344ac34b5df602a49da25d58655a24a0ffcd208

Conparser experiments:

Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 https://github.com/stanfordnlp/stanza/commit/110031e29259b34be6f958fd6d67d4774d6b084a
TREE_LSTM constituent composition method (didn't beat MAX) https://github.com/stanfordnlp/stanza/commit/2f722c828fa1364131b670da5b925082e9aa336a
Learned weighting between bert layers (this did help a little) https://github.com/stanfordnlp/stanza/commit/2d0c69ee449501155225efc2afb53b4ba6eeefe7
Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN https://github.com/stanfordnlp/stanza/pull/1148
New in_order_compound transition scheme: no improvement https://github.com/stanfordnlp/stanza/commit/f560b08902cf9f9e20656697c367500389115057
Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency https://github.com/stanfordnlp/stanza/commit/2706c4b100285e50f3d9a69e51ca5955e15ba41d https://github.com/stanfordnlp/stanza/commit/f500936b5ca4ba2305a028241996e5d198afd94b
Report the scores of tags when retagging (does not affect the conparser training) https://github.com/stanfordnlp/stanza/commit/766341942962e5a5a0aa0cda3dd170ac098ac6f9
FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 https://github.com/stanfordnlp/stanza/commit/90a8337083f0dc057ea2a9ee794595a6b292850f
LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax https://github.com/stanfordnlp/stanza/commit/5edd7242073720aff94f07904009ce0cad47b7ff
Maxout layer: didn't help https://arxiv.org/abs/1302.4389 https://github.com/stanfordnlp/stanza/commit/c708ce7736ffb021f9a0065f2bedaa8b73de52ba
Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. https://github.com/stanfordnlp/stanza/commit/4954845ba4b16240e6acf8d45d83161a0dec8d33

v1.4.2

1 year ago

Stanza v1.4.2: Minor version bump to improve (python) dependencies

Pipeline cache in Multilingual is a single OrderedDict https://github.com/stanfordnlp/stanza/issues/1115#issuecomment-1239759362 https://github.com/stanfordnlp/stanza/commit/ba3f64d5f571b1dc70121551364fc89d103ca1cd
Don't require pytest for all installations unless needed for testing https://github.com/stanfordnlp/stanza/issues/1120 https://github.com/stanfordnlp/stanza/commit/8c1d9d80e2e12729f60f05b81e88e113fbdd3482
hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities https://github.com/stanfordnlp/stanza/issues/1120 https://github.com/stanfordnlp/stanza/commit/6a90ad4bacf923c88438da53219c48355b847ed3
Reorder & normalize installations in setup.py https://github.com/stanfordnlp/stanza/pull/1124

v1.4.1

1 year ago

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

New Polish NER model based on NKJP from Karol Saputa and ryszardtuora https://github.com/stanfordnlp/stanza/issues/1070 https://github.com/stanfordnlp/stanza/pull/1110
Make GermEval2014 the default German NER model, including an optional Bert version https://github.com/stanfordnlp/stanza/issues/1018 https://github.com/stanfordnlp/stanza/pull/1022
Japanese conversion of GSD by Megagon https://github.com/stanfordnlp/stanza/pull/1038
Marathi NER dataset from L3Cube. Includes a Sentiment model as well https://github.com/stanfordnlp/stanza/pull/1043
Thai conversion of LST20 https://github.com/stanfordnlp/stanza/commit/555fc0342decad70f36f501a7ea1e29fa0c5b317
Kazakh conversion of KazNERD https://github.com/stanfordnlp/stanza/pull/1091/commits/de6cd25c2e5b936bc4ad2764b7b67751d0b862d7

Other new models

Sentiment conversion of Tass2020 for Spanish https://github.com/stanfordnlp/stanza/pull/1104
VIT constituency dataset for Italian https://github.com/stanfordnlp/stanza/pull/1091/commits/149f1440dc32d47fbabcc498cfcd316e53aca0c6 ... and many subsequent updates
Combined UD models for Hebrew https://github.com/stanfordnlp/stanza/issues/1109 https://github.com/stanfordnlp/stanza/commit/e4fcf003feb984f535371fb91c9e380dd187fd12
For UD models with small train dataset & larger test dataset, flip the datasets UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL https://github.com/stanfordnlp/stanza/issues/1030 https://github.com/stanfordnlp/stanza/commit/9618d60d63c49ec1bfff7416e3f1ad87300c7073
Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF https://github.com/stanfordnlp/stanza/commit/47740c6252a6717f12ef1fde875cf19fa1cd67cc

Model improvements

Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost https://github.com/stanfordnlp/stanza/pull/1086
Pretrained charlm integrated into Sentiment. Improves English, others not so much https://github.com/stanfordnlp/stanza/pull/1025
LSTM, 2d maxpool as optional items in the Sentiment from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling https://github.com/stanfordnlp/stanza/pull/1098
First learn with AdaDelta, then with another optimizer in conparse training. Very helpful https://github.com/stanfordnlp/stanza/commit/b1d10d3bdd892c7f68d2da7f4ba68a6ae3087f52
Grad clipping in conparse training https://github.com/stanfordnlp/stanza/commit/365066add019096332bcba0da4a626f68b70d303

Pipeline interface improvements

GPU memory savings: charlm reused between different processors in the same pipeline https://github.com/stanfordnlp/stanza/pull/1028
Word vectors not saved in the NER models. Saves bandwidth & disk space https://github.com/stanfordnlp/stanza/pull/1033
Functions to return tagsets for NER and conparse models https://github.com/stanfordnlp/stanza/issues/1066 https://github.com/stanfordnlp/stanza/pull/1073 https://github.com/stanfordnlp/stanza/commit/36b84db71f19e37b36119e2ec63f89d1e509acb0 https://github.com/stanfordnlp/stanza/commit/2db43c834bc8adbb8b096cf135f0fab8b8d886cb
displaCy integration with NER and dependency trees https://github.com/stanfordnlp/stanza/commit/20714137d81e5e63d2bcee420b22c4fd2a871306

Bugfixes

Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex) TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook) https://github.com/stanfordnlp/stanza/pull/1056
Starting a new corenlp client w/o server shouldn't wait for the server to be available TY to Mariano Crosetti https://github.com/stanfordnlp/stanza/issues/1059 https://github.com/stanfordnlp/stanza/pull/1061
Read raw glove word vectors (they have no header information) https://github.com/stanfordnlp/stanza/pull/1074
Ensure that illegal languages are not chosen by the LangID model https://github.com/stanfordnlp/stanza/issues/1076 https://github.com/stanfordnlp/stanza/pull/1077
Fix cache in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1115 https://github.com/stanfordnlp/stanza/commit/cdf18d8b19c92b0cfbbf987e82b0080ea7b4db32
Fix loading of previously unseen languages in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1101 https://github.com/stanfordnlp/stanza/commit/e551ebe60a4d818bc5ba8880dda741cc8bd1aed7
Fix that conparse would occasionally train to NaN early in the training https://github.com/stanfordnlp/stanza/commit/c4d785729e42ac90f298e0ef4ab487d14fa35591

Improved training tools

W&B integration for all models: can be activated with --wandb flag in the training scripts https://github.com/stanfordnlp/stanza/pull/1040
New webpages for building charlm, NER, and Sentiment https://stanfordnlp.github.io/stanza/new_language_charlm.html https://stanfordnlp.github.io/stanza/new_language_ner.html https://stanfordnlp.github.io/stanza/new_language_sentiment.html
Script to download Oscar 2019 data for charlm from HF (requires datasets module) https://github.com/stanfordnlp/stanza/pull/1014
Unify sentiment training into a Python script, replacing the old shell script https://github.com/stanfordnlp/stanza/pull/1021 https://github.com/stanfordnlp/stanza/pull/1023
Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese https://github.com/stanfordnlp/stanza/pull/1024
Slightly faster charlm training https://github.com/stanfordnlp/stanza/pull/1026
Data conversion of WikiNER generalized for retraining / add new WikiNER models https://github.com/stanfordnlp/stanza/pull/1039
XPOS factory now determined at start of POS training. Makes addition of new languages easier https://github.com/stanfordnlp/stanza/pull/1082
Checkpointing and continued training for charlm, conparse, sentiment https://github.com/stanfordnlp/stanza/pull/1090 https://github.com/stanfordnlp/stanza/commit/0e6de808eacf14cd64622415eeaeeac2d60faab2 https://github.com/stanfordnlp/stanza/commit/e5793c9dd5359f7e8f4fe82bf318a2f8fd190f54
Option to write the results of a NER model to a file https://github.com/stanfordnlp/stanza/pull/1108
Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools https://github.com/stanfordnlp/stanza/commit/6544ef3fa5e4f1b7f06dbcc5521fbf9b1264197a
Convert an AMT NER result to Stanza .json https://github.com/stanfordnlp/stanza/commit/cfa7e496ca7c7662478e03c5565e1b2b2c026fad
Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters https://github.com/stanfordnlp/stanza/commit/5a5e9187f81bd76fcd84ad713b51215b64234986 https://github.com/stanfordnlp/stanza/commit/b32a98e477e9972737ad64deea0bda8d6cebb4ec and others

Stanza Versions Save

v1.8.2

Old English

MWT improvements

Peft memory management

Other bugfixes and minor upgrades

Other upgrades

v1.8.1

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Additional 1.8.1 Bugfixes

v1.8.0

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

v1.7.0

Neural coref processor added!

Interface change: English MWT

Other updates

Updated requirements

v1.6.1

Multiple model levels

Multiple output heads for one NER model

Other features

Bugfixes

v1.6.0

Multiple model levels

Multiple output heads for one NER model

Other features

Bugfixes

v1.5.1

Features

Bugfixes

Minor updates

New Models

v1.5.0

Ssurgeon interface

CoreNLP integration:

Bugfixes:

Features:

New models:

Conparser experiments:

v1.4.2

Stanza v1.4.2: Minor version bump to improve (python) dependencies

v1.4.1

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

New NER models

Other new models

Model improvements

Pipeline interface improvements

Bugfixes

Improved training tools