Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.
Fix words ending with -nna
split into MWT https://github.com/stanfordnlp/handparsed-treebank/commit/2c48d4093daddc790bf89d7b35c47ee4d7d272d1 https://github.com/stanfordnlp/stanza/issues/1366
Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) https://github.com/stanfordnlp/stanza/issues/1371 https://github.com/stanfordnlp/stanza/pull/1378
Mark start_char
and end_char
on an MWT if it is composed of exactly its subwords https://github.com/stanfordnlp/stanza/commit/23840891c37d54a5cf491ea58b0702987dd4a6d7 https://github.com/stanfordnlp/stanza/issues/1361
Fix crash when trying to load previously unknown language https://github.com/stanfordnlp/stanza/issues/1360 https://github.com/stanfordnlp/stanza/commit/381736f8fb9b60a929002cc750bd0df3d7dad03a
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: https://github.com/stanfordnlp/stanza/commit/d180ae02b278dd09dff53bc910e7aa43656e944d https://github.com/stanfordnlp/stanza/issues/1367
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length https://github.com/stanfordnlp/stanza/commit/42718135e2ab4b145bbb5861d55bb9424ca3549f
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) https://github.com/stanfordnlp/stanza/commit/597d48f1ead89fa9a0cca86cf9f0b530ed249792
Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation https://github.com/stanfordnlp/stanza/commit/57bfa8bbd8d3d42d4ee29d4a406640b126ce0f46 https://github.com/stanfordnlp/stanza/issues/1356#issuecomment-1981216912
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: https://github.com/stanfordnlp/stanza/commit/4048caed1b89030082d23b8f71d23bae6c9c54f1 https://github.com/stanfordnlp/stanza/commit/15b136bb30dda272d318a61a5f602e7fc81e7a31
Add a constituency model for German https://github.com/stanfordnlp/stanza/commit/7a4f48c738f0db8923aa5da88d0a9743eaee4c6a https://github.com/stanfordnlp/stanza/commit/86ddaab31c73a7d0a389d0557f3696c29d441657 https://github.com/stanfordnlp/stanza/issues/1368
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate
model.
The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the default_accurate
package used transformers where the head was trained but the transformer itself was not finetuned.
download_resources_json
was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you @ider-zh.get()
https://github.com/stanfordnlp/stanza/commit/13ee3d5cbc2c9174c3e0c67ca75b580e4fe683b1 https://github.com/stanfordnlp/stanza/issues/1357
device
arg in MultilingualPipeline
would crash if device
was passed for an individual Pipeline
: https://github.com/stanfordnlp/stanza/commit/44058a0ec296c6da5997bfaf8911a26d425d2cec
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate
model.
The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the default_accurate
package used transformers where the head was trained but the transformer itself was not finetuned.
download_resources_json
was broken: https://github.com/stanfordnlp/stanza/pull/1318 https://github.com/stanfordnlp/stanza/issues/1317 Thank you @ider-zhConjunction-Aware Word-Level Coreference Resolution https://arxiv.org/abs/2310.06165 original implementation: https://github.com/KarelDO/wl-coref/tree/master
Updated form of Word-Level Coreference Resolution https://aclanthology.org/2021.emnlp-main.605/ original implementation: https://github.com/vdobrovolskii/wl-coref
If you use Stanza's coref module in your work, please be sure to cite both of the above papers.
Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.
Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.
Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models
https://github.com/stanfordnlp/stanza/pull/1309
English now has an MWT model by default. Text such as won't
is now marked as a single token, split into two words, will
and not
. Previously it was expected to be tokenized into two pieces, but the Sentence
object containing that text would not have a single Token
object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.
Code that used to operate with for word in sentence.words
will continue to work as before, but for token in sentence.tokens
will now produce one object for MWT such as won't
, cannot
, Stanza's
, etc.
Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline
creation time if the language and package includes MWT.
https://github.com/stanfordnlp/stanza/pull/1314/commits/f22dceb93275fc724536b03b31c08a94617880ca https://github.com/stanfordnlp/stanza/pull/1314/commits/27983aefe191f6abd93dd49915d2515d7c3973d1
conll_as_string
and doc2conll_text
. Use "{:C}".format(doc)
instead https://github.com/stanfordnlp/stanza/commit/e01650f9c56382495082a9a24fa0310414c46651
doc_id
field if the document they are created from has a doc_id
. https://github.com/stanfordnlp/stanza/pull/1314/commits/8e2201f42cb99a5a3d8358ce59501c1d88f2585e
peft
module used for finetuning the transformer used in the coref processor does not support those versions.peft
as an optional dependency to transformer based installationsnetworkx
as a dependency for reading enhanced dependencies. Added toml
as a dependency for reading the coref config.V1.6.1 is a patch of a bug in the Arabic POS tagger.
We also mark Python 3.11 as supported in the setup.py
classifiers. This will be the last release that supports Python 3.6
The package
parameter for building the Pipeline
now has three default settings:
default
, the same as before, where POS, depparse, and NER use the charlm, but lemma does notdefault-fast
, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as welldefault-accurate
, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcomeFurthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast
and -accurate
versions for each UD dataset.
PR: https://github.com/stanfordnlp/stanza/pull/1287
addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284
The NER models now can learn multiple output layers at once.
https://github.com/stanfordnlp/stanza/pull/1289
Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.
Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:
original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47
We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm
, ontonotes-combined_charlm
, and ontonotes-combined_electra-large
.
Future plans include using multiple NER datasets for other models as well.
Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline
, you can now provide a callable
via the tokenize_postprocessor
parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline
https://github.com/stanfordnlp/stanza/pull/1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb
charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: https://github.com/stanfordnlp/stanza/commit/b56f442d4d179c07411a44a342c224408eb6a6a9
Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824
run_ete.py
was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7
Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef
The package
parameter for building the Pipeline
now has three default settings:
default
, the same as before, where POS, depparse, and NER use the charlm, but lemma does notdefault-fast
, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as welldefault-accurate
, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcomeFurthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast
and -accurate
versions for each UD dataset.
PR: https://github.com/stanfordnlp/stanza/pull/1287
addresses https://github.com/stanfordnlp/stanza/issues/1259 and https://github.com/stanfordnlp/stanza/issues/1284
The NER models now can learn multiple output layers at once.
https://github.com/stanfordnlp/stanza/pull/1289
Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.
Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:
original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47
We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm
, ontonotes-combined_charlm
, and ontonotes-combined_electra-large
.
Future plans include using multiple NER datasets for other models as well.
Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline
, you can now provide a callable
via the tokenize_postprocessor
parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline
https://github.com/stanfordnlp/stanza/pull/1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though https://github.com/stanfordnlp/stanza/commit/45ef5445f44222df862ed48c1b3743dc09f3d3fd
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https://github.com/stanfordnlp/stanza/issues/1279 https://github.com/stanfordnlp/stanza/commit/88cd0df5da94664cb04453536212812dc97339bb
charlm for PT (improves accuracy on non-transformer models): https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
build models with transformers for a few additional languages: MR, AR, PT, JA https://github.com/stanfordnlp/stanza/commit/45b387531c67bafa9bc41ee4d37ba0948daa9742 https://github.com/stanfordnlp/stanza/commit/0f3761ee63c57f66630a8e94ba6276900c190a74 https://github.com/stanfordnlp/stanza/commit/c55472acbd32aa0e55d923612589d6c45dc569cc https://github.com/stanfordnlp/stanza/commit/c10763d0218ce87f8f257114a201cc608dbd7b3a
Scenegraph CoreNLP connection needed to be checked before sending messages: https://github.com/stanfordnlp/CoreNLP/issues/1346#issuecomment-1713267522 https://github.com/stanfordnlp/stanza/commit/c71bf3fdac8b782a61454c090763e8885d0e3824
run_ete.py
was not correctly processing the charlm, meaning the whole thing wouldn't actually run https://github.com/stanfordnlp/stanza/commit/16f29f3dcf160f0d10a47fec501ab717adf0d4d7
Chinese NER model was pointing to the wrong pretrain https://github.com/stanfordnlp/stanza/issues/1285 https://github.com/stanfordnlp/stanza/commit/82a02151da17630eb515792a508a967ef70a6cef
depparse can have transformer as an embedding https://github.com/stanfordnlp/stanza/pull/1282/commits/ee171cd167900fbaac16ff4b1f2fbd1a6e97de0a
Lemmatizer can remember word,pos it has seen before with a flag https://github.com/stanfordnlp/stanza/issues/1263 https://github.com/stanfordnlp/stanza/commit/a87ffd0a4f43262457cf7eecf5555a621c6dc24e
Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) https://github.com/stanfordnlp/stanza/pull/1282/commits/63dc212b467cd549039392743a0be493cc9bc9d8 https://github.com/stanfordnlp/stanza/pull/1282/commits/c42aed569f9d376e71708b28b0fe5b478697ba05 https://github.com/stanfordnlp/stanza/pull/1282/commits/eab062341480e055f93787d490ff31d923a68398
SceneGraph connection for the CoreNLP client https://github.com/stanfordnlp/stanza/pull/1282/commits/d21a95cc90443ec4737de6d7ba68a106d12fb285
Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance https://github.com/stanfordnlp/stanza/pull/1282/commits/f753a4f35b7c2cf7e8e6b01da3a60f73493178e1
Tokenize [] based on () rules if the original dataset doesn't have [] in it https://github.com/stanfordnlp/stanza/pull/1282/commits/063b4ba3c6ce2075655a70e54c434af4ce7ac3a9
Attempt to finetune the charlm when building models (have not found effective settings for this yet) https://github.com/stanfordnlp/stanza/pull/1282/commits/048fdc9c9947a154d4426007301d63d920e60db0
Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate https://github.com/stanfordnlp/stanza/pull/1282/commits/e811f52b4cf88d985e7dbbd499fe30dbf2e76d8d https://github.com/stanfordnlp/stanza/pull/1282/commits/66add6d519deb54ca9be5fe3148023a5d7d815e4 https://github.com/stanfordnlp/stanza/pull/1282/commits/f086de2359cce16ef2718c0e6e3b5deef1345c74
Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 https://github.com/stanfordnlp/stanza/commit/4dda14bd585893044708c70e30c1c3efec509863 https://github.com/bjascob/LemmInflect/issues/14#issuecomment-1470954013
prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges https://github.com/stanfordnlp/stanza/commit/78ff85ce7eed596ad195a3f26474065717ad63b3
Fix an empty bulk_process
throwing an exception https://github.com/stanfordnlp/stanza/pull/1282/commits/5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e https://github.com/stanfordnlp/stanza/issues/1278
Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors https://github.com/stanfordnlp/stanza/pull/1282/commits/e0917b0967ba9752fdf489b86f9bfd19186c38eb
Put NER and POS scores on one line to make it easier to grep for: https://github.com/stanfordnlp/stanza/commit/da2ae33e8ef9e48842685dfed88896b646dba8c4 https://github.com/stanfordnlp/stanza/commit/8c4cb04d38c1101318755270f3aa75c54236e3fe
Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https://github.com/stanfordnlp/stanza/pull/1282/commits/d1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others
Pipeline uses torch.no_grad()
for a slight speed boost https://github.com/stanfordnlp/stanza/pull/1282/commits/36ab82edfc574d46698c5352e07d2fcb0d68d3b3
Generalize save names, which eventually allows for putting transformer
, charlm
or nocharlm
in the save name - this lets us distinguish different complexities of model https://github.com/stanfordnlp/stanza/pull/1282/commits/cc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models
Add the model's flags to the --help
for the run
scripts, such as https://github.com/stanfordnlp/stanza/pull/1282/commits/83c0901c6ca2827224e156477e42e403d330a16e https://github.com/stanfordnlp/stanza/pull/1282/commits/7c171dd8d066c6973a8ee18a016b65f62376ea4c https://github.com/stanfordnlp/stanza/pull/1282/commits/8e1d112bee42f2211f5153fcc89083b97e3d2600
Remove the dependency on six
https://github.com/stanfordnlp/stanza/pull/1282/commits/6daf97142ebc94cca7114a8cda5a20bf66f7f707 (thank you @BLKSerene )
VLSP constituency https://github.com/stanfordnlp/stanza/commit/500435d3ec1b484b0f1152a613716565022257f2
VLSP constituency -> tagging https://github.com/stanfordnlp/stanza/commit/cb0f22d7be25af0b3b2790e3ce1b9dbc277c13a7
CTB 5.1 constituency https://github.com/stanfordnlp/stanza/pull/1282/commits/f2ef62b96c79fcaf0b8aa70e4662d33b26dadf31
Add support for CTB 9.0, although those models are not distributed yet https://github.com/stanfordnlp/stanza/pull/1282/commits/1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f
Added an Indonesian charlm
Indonesian constituency from ICON treebank https://github.com/stanfordnlp/stanza/pull/1218
All languages with pretrained charlms now have an option to use that charlm for dependency parsing
French combined models out of GSD
, ParisStories
, Rhapsodie
, and Sequoia
https://github.com/stanfordnlp/stanza/pull/1282/commits/ba64d37d3bf21af34373152e92c9f01241e27d8b
UD 2.12 support https://github.com/stanfordnlp/stanza/pull/1282/commits/4f987d2cd708ce4ca27935d347bb5b5d28a78058
Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/
In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.
detach().cpu()
speeds things up significantly in some cases https://github.com/stanfordnlp/stanza/commit/ccfbc56b3b312fdde1350104a0d0d5645c9c80cc
"{:C}"
for document objects which prints out documents as CoNLL: https://github.com/stanfordnlp/stanza/pull/1169
en_combined_bert
model, others to come https://github.com/stanfordnlp/stanza/pull/1132
Pipeline cache in Multilingual is a single OrderedDict https://github.com/stanfordnlp/stanza/issues/1115#issuecomment-1239759362 https://github.com/stanfordnlp/stanza/commit/ba3f64d5f571b1dc70121551364fc89d103ca1cd
Don't require pytest
for all installations unless needed for testing
https://github.com/stanfordnlp/stanza/issues/1120
https://github.com/stanfordnlp/stanza/commit/8c1d9d80e2e12729f60f05b81e88e113fbdd3482
hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities https://github.com/stanfordnlp/stanza/issues/1120 https://github.com/stanfordnlp/stanza/commit/6a90ad4bacf923c88438da53219c48355b847ed3
Reorder & normalize installations in setup.py https://github.com/stanfordnlp/stanza/pull/1124
We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.
New Polish NER model based on NKJP from Karol Saputa and ryszardtuora https://github.com/stanfordnlp/stanza/issues/1070 https://github.com/stanfordnlp/stanza/pull/1110
Make GermEval2014 the default German NER model, including an optional Bert version https://github.com/stanfordnlp/stanza/issues/1018 https://github.com/stanfordnlp/stanza/pull/1022
Japanese conversion of GSD by Megagon https://github.com/stanfordnlp/stanza/pull/1038
Marathi NER dataset from L3Cube. Includes a Sentiment model as well https://github.com/stanfordnlp/stanza/pull/1043
Thai conversion of LST20 https://github.com/stanfordnlp/stanza/commit/555fc0342decad70f36f501a7ea1e29fa0c5b317
Kazakh conversion of KazNERD https://github.com/stanfordnlp/stanza/pull/1091/commits/de6cd25c2e5b936bc4ad2764b7b67751d0b862d7
Sentiment conversion of Tass2020 for Spanish https://github.com/stanfordnlp/stanza/pull/1104
VIT constituency dataset for Italian https://github.com/stanfordnlp/stanza/pull/1091/commits/149f1440dc32d47fbabcc498cfcd316e53aca0c6 ... and many subsequent updates
Combined UD models for Hebrew https://github.com/stanfordnlp/stanza/issues/1109 https://github.com/stanfordnlp/stanza/commit/e4fcf003feb984f535371fb91c9e380dd187fd12
For UD models with small train dataset & larger test dataset, flip the datasets UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL https://github.com/stanfordnlp/stanza/issues/1030 https://github.com/stanfordnlp/stanza/commit/9618d60d63c49ec1bfff7416e3f1ad87300c7073
Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF https://github.com/stanfordnlp/stanza/commit/47740c6252a6717f12ef1fde875cf19fa1cd67cc
Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost https://github.com/stanfordnlp/stanza/pull/1086
Pretrained charlm integrated into Sentiment. Improves English, others not so much https://github.com/stanfordnlp/stanza/pull/1025
LSTM, 2d maxpool as optional items in the Sentiment
from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
https://github.com/stanfordnlp/stanza/pull/1098
First learn with AdaDelta, then with another optimizer in conparse training. Very helpful https://github.com/stanfordnlp/stanza/commit/b1d10d3bdd892c7f68d2da7f4ba68a6ae3087f52
Grad clipping in conparse training https://github.com/stanfordnlp/stanza/commit/365066add019096332bcba0da4a626f68b70d303
GPU memory savings: charlm reused between different processors in the same pipeline https://github.com/stanfordnlp/stanza/pull/1028
Word vectors not saved in the NER models. Saves bandwidth & disk space https://github.com/stanfordnlp/stanza/pull/1033
Functions to return tagsets for NER and conparse models https://github.com/stanfordnlp/stanza/issues/1066 https://github.com/stanfordnlp/stanza/pull/1073 https://github.com/stanfordnlp/stanza/commit/36b84db71f19e37b36119e2ec63f89d1e509acb0 https://github.com/stanfordnlp/stanza/commit/2db43c834bc8adbb8b096cf135f0fab8b8d886cb
displaCy integration with NER and dependency trees https://github.com/stanfordnlp/stanza/commit/20714137d81e5e63d2bcee420b22c4fd2a871306
Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex) TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook) https://github.com/stanfordnlp/stanza/pull/1056
Starting a new corenlp client w/o server shouldn't wait for the server to be available TY to Mariano Crosetti https://github.com/stanfordnlp/stanza/issues/1059 https://github.com/stanfordnlp/stanza/pull/1061
Read raw glove word vectors (they have no header information) https://github.com/stanfordnlp/stanza/pull/1074
Ensure that illegal languages are not chosen by the LangID model https://github.com/stanfordnlp/stanza/issues/1076 https://github.com/stanfordnlp/stanza/pull/1077
Fix cache in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1115 https://github.com/stanfordnlp/stanza/commit/cdf18d8b19c92b0cfbbf987e82b0080ea7b4db32
Fix loading of previously unseen languages in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1101 https://github.com/stanfordnlp/stanza/commit/e551ebe60a4d818bc5ba8880dda741cc8bd1aed7
Fix that conparse would occasionally train to NaN early in the training https://github.com/stanfordnlp/stanza/commit/c4d785729e42ac90f298e0ef4ab487d14fa35591
W&B integration for all models: can be activated with --wandb flag in the training scripts https://github.com/stanfordnlp/stanza/pull/1040
New webpages for building charlm, NER, and Sentiment https://stanfordnlp.github.io/stanza/new_language_charlm.html https://stanfordnlp.github.io/stanza/new_language_ner.html https://stanfordnlp.github.io/stanza/new_language_sentiment.html
Script to download Oscar 2019 data for charlm from HF (requires datasets
module)
https://github.com/stanfordnlp/stanza/pull/1014
Unify sentiment training into a Python script, replacing the old shell script https://github.com/stanfordnlp/stanza/pull/1021 https://github.com/stanfordnlp/stanza/pull/1023
Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese https://github.com/stanfordnlp/stanza/pull/1024
Slightly faster charlm training https://github.com/stanfordnlp/stanza/pull/1026
Data conversion of WikiNER generalized for retraining / add new WikiNER models https://github.com/stanfordnlp/stanza/pull/1039
XPOS factory now determined at start of POS training. Makes addition of new languages easier https://github.com/stanfordnlp/stanza/pull/1082
Checkpointing and continued training for charlm, conparse, sentiment https://github.com/stanfordnlp/stanza/pull/1090 https://github.com/stanfordnlp/stanza/commit/0e6de808eacf14cd64622415eeaeeac2d60faab2 https://github.com/stanfordnlp/stanza/commit/e5793c9dd5359f7e8f4fe82bf318a2f8fd190f54
Option to write the results of a NER model to a file https://github.com/stanfordnlp/stanza/pull/1108
Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools https://github.com/stanfordnlp/stanza/commit/6544ef3fa5e4f1b7f06dbcc5521fbf9b1264197a
Convert an AMT NER result to Stanza .json https://github.com/stanfordnlp/stanza/commit/cfa7e496ca7c7662478e03c5565e1b2b2c026fad
Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters https://github.com/stanfordnlp/stanza/commit/5a5e9187f81bd76fcd84ad713b51215b64234986 https://github.com/stanfordnlp/stanza/commit/b32a98e477e9972737ad64deea0bda8d6cebb4ec and others