CoreNLP Versions Save

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.

v4.5.7

2 weeks ago

UD converter upgrades

Inspired by https://github.com/UniversalDependencies/docs/issues/717, although the work is not finished

Add an option to use the PTBCorrector, which fixes many (although not all) incorrect POS tags https://github.com/stanfordnlp/CoreNLP/commit/5e57eaba40897ee93b69ed3f11bda511f6b427d8
Treat sort of the same as kind of https://github.com/stanfordnlp/CoreNLP/commit/bc4acf11d165c4185121ff501c26b354a05a2477
en masse is flat https://github.com/stanfordnlp/CoreNLP/commit/cb338cd57fdcd9ef0fc1aa1fe2fa563d578fea15
dinna is an MWT https://github.com/stanfordnlp/CoreNLP/commit/1dd746cfea4f82e3b1c161bcc95c457f0d8a2618
Use AUX as the POS in the converter when appropriate https://github.com/stanfordnlp/CoreNLP/commit/30f2f8e7d92492a152dd5fc8b85327860b44cc2a
Fix (heh) all but and whether or not https://github.com/stanfordnlp/CoreNLP/commit/25136768ee22e5431051d756c4c63c41af00de99
Dependency dep -> ccomp for fronted say verbs https://github.com/stanfordnlp/CoreNLP/commit/a76a854ce249ae028eec010b1a48d68748d59a61

Parser evaluation improvements

Include the F1 scores of each tree when scoring a constituency dataset https://github.com/stanfordnlp/CoreNLP/commit/2725b06fa96400e9c25e314b5e16b18720764ab2

v4.5.6

3 months ago

v4.5.5

8 months ago

Ssurgeon updates beyond the capabilities listed in the GURT paper

MergeNodes operation: combine two words into one word in a graph. one word must be a leaf headed by the other for this to work https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/0660fa9d0932fa35b4da1b9817f567b968c2a8ec
CombineMWT operation: mark MWT on two or more words. Stanza will treat these as Token https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/010a955f6faafcfcf0e9a2a42302073ae34cb27b
DeleteLeaf operation: remove a leaf, renumber the subsequent words https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/429f61aafd89c9c5406873f908380e6fc61f23c8

Bugfixes

fix graph serialization for sentences longer than 128 words (IdentityHashSet doesn't work for integers beyond 128) https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/d8d9d9fdded4fc2a578258cd78bd15462c004b1b
fix valueOf for SemanticGraph if a word is just a dash https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/203eb065cbd86e34ae9388fe6515ef278d580374
fix memory usage of evaluating a PCFG model, which would run out of memory because it was saving all of the charts while evaluating https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/b2e67b000461af663c22962940b795961adbe7aa
Tregex pattern would not correctly display when using optional patterns: https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/a9965b2bbca615c8838cc2b24ca1403ec545c98c https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/8659653dc81827b249277d531f781bc926540743
Tregex would infinite loop on certain optional patterns which were theoretically legal https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/cc7983ec267b77a0eb9b2df1f8b5467cf47a1cd9

Security fixes

update xom to 1.3.9, which should avoid unwanted, potentially vulnerable transitive dependencies https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/c8772b740dbde0e50a1f4cbc941b368710c9de16
remove bz2 zip & unzip, which used a shell command and therefore could be hijacked https://nvd.nist.gov/vuln/detail/CVE-2023-39020

English dependency converter fixes

addressing issue https://github.com/stanfordnlp/CoreNLP/discussions/1363
fix (QP up to ...) https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/8c46648e452e2f074cda695b5d32ad09a40f363a https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/9a86ece4dd8c4b823b5c5f40b22352489ccd8835
fix up to 1700 kilograms if misparsed in a predicable manner https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/6e145278f82156575ec53782f802dff3d5ae507b
better LST coverage https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/5745de5b4309ed3090ecd785fc3e5bfe6f696cf5
vmod/acl when the parser misinterprets NP vs NML https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/ad4556d8c1146f3ee6c89c52770e8d4a4a072394
treat lists of NML as repeated modifiers of a noun, instead of a list, as that is the likely meaning of NML. example: a 72-game, three-month season from PTB https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/61ef545efac3eda7c46f29b3c01a38c8aa26a924 https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/5e748dcfd7eabd04009d450c60f29f8d097d9570

Server features

Scenegraph endpoint https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/8b40947914884381ce54fa35c0ced9f0a26e764b https://github.com/stanfordnlp/CoreNLP/issues/1346
remove one json library to reduce number of json libraries we depend on https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/357b1bb22251fa220b4ee6337c329e0c8244122d

Small changes

allow fourty as a number in SUTime https://github.com/stanfordnlp/CoreNLP/commit/7fbb7b81d37c24512677f82169ade111c1e023b3
capture forty (40) days as a duration in SUTime https://github.com/stanfordnlp/CoreNLP/commit/b3c47a05395b2d515e0f75ca9fafada0099ee758
feature to print out the feature index of an NER model as a text file https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/f6366737dbca53f0f8312ea508c1cd3607d4e263
clarify the INTJ rule for the ChineseHeadFinder https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/56cd6bb3e71d5a434ddc8b8a6f25b3ff50f85436
consider { } as punctuation when scoring English constituency treebanks https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/a606afa9e2906ebad3d860107350f204d7d357d8
fix error in test case, from @tanloong https://github.com/stanfordnlp/CoreNLP/pull/1373 https://github.com/stanfordnlp/CoreNLP/issues/1372
dead code cleanup https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/86b6a03cd61d2fdc011849c795da12a6a24dcec6

v4.5.4

1 year ago

Minor Ssurgeon bugfixes (make it harder to infinite loop with EditNode or RelabelNamedEdge)
Add a ReattachNamedEdge which is a combination of RemoveNamedEdge and AddEdge with new endpoints
include the Morphology CLI for using the CoreNLP lemmatizer from elsewhere, such as Python

v4.5.3

1 year ago

Mostly changes to Semgrex, along with adding Ssurgeon to the download package for general consumption. This involved quite a few changes to classes such as AnnotationLookup. The released version should now match the Semgrex/Ssurgeon paper published at GURT 2023.

Ssurgeon / Semgrex

Update Semgrex and Ssurgeon to match the paper published at GURT: https://aclanthology.org/2023.tlt-1.7/

Bugfixes

Fix "Could not match" errors which occurred when scoring treebanks using a tagger that produces non-gold punct tags: https://github.com/stanfordnlp/CoreNLP/pull/1344
Fix typo in KBP children rules: https://github.com/stanfordnlp/CoreNLP/commit/dbdb55b32fc623228b244035eb3c90b85088dbbd

Minor features

Add the choice of dependency graph to output to the TextOutputter https://github.com/stanfordnlp/CoreNLP/commit/33e6c42b37a0e99c7ab6b37551fcc44c2fb2651b https://github.com/stanfordnlp/CoreNLP/issues/1339
Hopefully minor interface change: make relation in SemanticGraphEdge final, get rid of setRelation https://github.com/stanfordnlp/CoreNLP/commit/e7a7657713e6feb2b048eb717a28ba82f2a64fdd

v4.5.2

1 year ago

Bugfixes

Tokenize c'mon and $$$ https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/1e216deaca90c16fdffa396aebbe9d128778c29d
Tokenize 'email' https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/76b5a6b3c20e518041f988638606cb1e60070be3 https://github.com/stanfordnlp/CoreNLP/issues/1316
Return empty mentions for empty document https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/da086643647907da3cce036ba0c389f078a90342 https://github.com/stanfordnlp/CoreNLP/issues/1322
Fix CLI protobuf tools running too fast for some network conditions: https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/412da5c4e1d457418a07ba639ffc2b2e080d0df1

CLI protobuf tools

Add output of lemmatizer to words https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/71bc95dfaf984f7056e0856414738be0706cf9e3
Convert constituency trees to dependencies https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/b118082c1403c8ad7d3c18fc7a211d24fcccb173

Dependency updates

Protobuf 3.19.6 https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/0439b623a4828dfaa4902d2eaac6cc2b9a46973c
xom 1.3.8, which no longer automatically includes xalan https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/3ded6f03f3956e83439755351048401dae85bf72

Semgraph / Semgrex improvements

Allow reuse of indices in SemanticGraph.valueOf https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/cf97e3647582ea492ca9e4aff2ed9268231201db
Add Semgrex relations to match the capabilities introduced in Spacy https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/98be52a88a996feb6896bddb5f61ea8074eb3ae0

v4.5.1

1 year ago

CoreNLP 4.5.1

Bugfixes!

Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word https://github.com/stanfordnlp/CoreNLP/commit/974383ab7336a254d260264885186dd77df0cf81
Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. https://github.com/stanfordnlp/CoreNLP/issues/1289 https://github.com/stanfordnlp/CoreNLP/commit/655018895e2f2870ce721de42d31b845fa991335
Fix \r\n not being properly processed on Windows: #1291 https://github.com/stanfordnlp/CoreNLP/commit/9889f4ef4ee9feb8b70f577db8353c8d6c896ae3
Handle one half of surrogate character pairs in the tokenizer w/o crashing https://github.com/stanfordnlp/CoreNLP/issues/1298 https://github.com/stanfordnlp/CoreNLP/commit/1b12faa64b9ea85f808b27ab74ccf9f79ccb01f4
Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: https://github.com/stanfordnlp/CoreNLP/issues/1296 https://github.com/stanfordnlp/CoreNLP/issues/1229 https://github.com/stanfordnlp/CoreNLP/issues/1169 https://github.com/stanfordnlp/CoreNLP/commit/f99b5ab87f073118a971c4d1e39df85ab9abbab1

v4.5.0

1 year ago

CoreNLP 4.5.0

Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

All PTB and German tokens normalized now in PTBLexer (previously only German umlauts). This makes the tokenizer 2% slower, but should avoid issues with resume' for example https://github.com/stanfordnlp/CoreNLP/commit/d46fecd93c6964f635efe85d9b7c327ee8880fb9
log4j removed entirely from public CoreNLP (internal "research" branch still has a use) https://github.com/stanfordnlp/CoreNLP/commit/f05cb54ec0a4f3c90395771817f44a81eb549baf
Fix NumberFormatException showing up in NER models: https://github.com/stanfordnlp/CoreNLP/issues/547 https://github.com/stanfordnlp/CoreNLP/commit/5ee2c391104109a338a28f35c647b7684b00ad41
Fix "seconds" in the lemmatizer: https://github.com/stanfordnlp/CoreNLP/commit/e7a073bde9ba7bbdb40ba81ed96d379455629e44
Fix double escaping of & in the online demos: https://github.com/stanfordnlp/CoreNLP/commit/8413fa1fc432aa2a13cbb4a296352bb9bad4d0cb
Report the cause of an error if "tregex" is asked for but no parse annotator is added: https://github.com/stanfordnlp/CoreNLP/commit/4db80c051322697c983ecda873d8d38f808cb96c
Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): https://github.com/stanfordnlp/CoreNLP/pull/1259
Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: https://github.com/stanfordnlp/CoreNLP/pull/1263
Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: https://github.com/stanfordnlp/CoreNLP/commit/3c40ba32ca51af02936b907d03406e2158883f7b https://github.com/stanfordnlp/CoreNLP/commit/58a2288239f631df47fac3eed105fe78c08b1a5d https://github.com/stanfordnlp/CoreNLP/commit/8b97d64e48e6d4161f62a8635d2bb4cee2e95553
Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas https://github.com/stanfordnlp/CoreNLP/commit/9476a8eb724e01df4b05bce38789dd8a7e61397c https://github.com/stanfordnlp/CoreNLP/commit/6193934af8ae0abb0b4c6a2522d7efdfa426e5b3 https://github.com/stanfordnlp/CoreNLP/commit/afb1ea89c874acd58bab584f1e29a059c44dfd20 https://github.com/stanfordnlp/CoreNLP/commit/7c84960df4ac9d391ef37855572e2f8bc301ee17
Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases https://github.com/stanfordnlp/CoreNLP/pull/1266
Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) https://github.com/stanfordnlp/CoreNLP/commit/45b47e245c367663bba2e81a26ea7c29262ad0d8
Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models https://github.com/stanfordnlp/CoreNLP/commit/0d9e9c829bfa75bb661cccea03fc682a0f955f0d
Fix NBSP in the Chinese segmenter https://github.com/stanfordnlp/stanza/issues/1052 https://github.com/stanfordnlp/CoreNLP/pull/1279

v4.4.0

2 years ago

Enhancements

added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline
tsurgeon CLI - python side added to stanza
https://github.com/stanfordnlp/CoreNLP/pull/1240
sutime WORKDAY definition https://github.com/stanfordnlp/CoreNLP/commit/0dfb11817c2b46a532985c24289e128fbb81a2c0

Fixes

rebuilt Italian dependency parser using CoreNLP predicted tags
XML security issue: https://github.com/stanfordnlp/CoreNLP/pull/1241
NER server security issue: https://github.com/stanfordnlp/CoreNLP/commit/5ee097dbede547023e88f60ed3f430ff09398b87
fix infinite loop in tregex: https://github.com/stanfordnlp/CoreNLP/pull/1238
json utf-8 output on windows https://github.com/stanfordnlp/CoreNLP/pull/1231 https://github.com/stanfordnlp/stanza/issues/894
fix openie crash in certain unusual graphs https://github.com/stanfordnlp/CoreNLP/pull/1230 https://github.com/stanfordnlp/CoreNLP/issues/1082
fix nondeterministic results in certain SemanticGraph structures https://github.com/stanfordnlp/CoreNLP/pull/1228 https://github.com/stanfordnlp/CoreNLP/commit/cc806f265292977b69fd55f36408fe5ad3a695a0
workaround for NLTK sending % unescaped to the server https://github.com/stanfordnlp/CoreNLP/issues/1226 https://github.com/stanfordnlp/CoreNLP/commit/20fe1e996455b1c1434022d6e7f0b8524f41f253
make TimingTest function on Windows https://github.com/stanfordnlp/CoreNLP/commit/4aafb84f6ea5c0102c921a503cbfb8e3d34f3e22

v4.3.2

2 years ago

Fixes

fix issues with default Italian pipeline

CoreNLP Versions Save

v4.5.7

UD converter upgrades

Parser evaluation improvements

v4.5.6

English Lemmatizer upgrades

Tokenizer upgrades

UD Processing upgrades

Other Bugfixes

Minor API change

Ssurgeon

v4.5.5

Ssurgeon updates beyond the capabilities listed in the GURT paper

Bugfixes

Security fixes

English dependency converter fixes

Server features

Small changes

v4.5.4

v4.5.3

Ssurgeon / Semgrex

Bugfixes

Minor features

v4.5.2

Bugfixes

CLI protobuf tools

Dependency updates

Semgraph / Semgrex improvements

v4.5.1

CoreNLP 4.5.1

Bugfixes!

v4.5.0

CoreNLP 4.5.0

v4.4.0

Enhancements

Fixes

v4.3.2

Fixes