CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
Inspired by https://github.com/UniversalDependencies/docs/issues/717, although the work is not finished
sort of
the same as kind of
https://github.com/stanfordnlp/CoreNLP/commit/bc4acf11d165c4185121ff501c26b354a05a2477
en masse
is flat https://github.com/stanfordnlp/CoreNLP/commit/cb338cd57fdcd9ef0fc1aa1fe2fa563d578fea15
dinna
is an MWT https://github.com/stanfordnlp/CoreNLP/commit/1dd746cfea4f82e3b1c161bcc95c457f0d8a2618
AUX
as the POS in the converter when appropriate https://github.com/stanfordnlp/CoreNLP/commit/30f2f8e7d92492a152dd5fc8b85327860b44cc2a
all but
and whether or not
https://github.com/stanfordnlp/CoreNLP/commit/25136768ee22e5431051d756c4c63c41af00de99
dep
-> ccomp
for fronted say
verbs https://github.com/stanfordnlp/CoreNLP/commit/a76a854ce249ae028eec010b1a48d68748d59a61
IntervalTree
https://github.com/stanfordnlp/CoreNLP/issues/1405 https://github.com/stanfordnlp/CoreNLP/commit/6d17c2390bcf745f919134a5725629783086f712
yield
, which became a keyword: https://github.com/stanfordnlp/CoreNLP/commit/e5c9d443984e1f7434f588e07e0e3212c33f8841 https://github.com/stanfordnlp/CoreNLP/commit/b084233fd6d5da6474d27c6d6832fd35b3a9cb8b
Token
https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/010a955f6faafcfcf0e9a2a42302073ae34cb27b
IdentityHashSet
doesn't work for integers beyond 128) https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/d8d9d9fdded4fc2a578258cd78bd15462c004b1b
valueOf
for SemanticGraph
if a word is just a dash https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/203eb065cbd86e34ae9388fe6515ef278d580374
(QP up to ...)
https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/8c46648e452e2f074cda695b5d32ad09a40f363a https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/9a86ece4dd8c4b823b5c5f40b22352489ccd8835
up to 1700 kilograms
if misparsed in a predicable manner https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/6e145278f82156575ec53782f802dff3d5ae507b
LST
coverage https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/5745de5b4309ed3090ecd785fc3e5bfe6f696cf5
vmod/acl
when the parser misinterprets NP
vs NML
https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/ad4556d8c1146f3ee6c89c52770e8d4a4a072394
NML
as repeated modifiers of a noun, instead of a list, as that is the likely meaning of NML
. example: a 72-game, three-month season
from PTB https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/61ef545efac3eda7c46f29b3c01a38c8aa26a924 https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/5e748dcfd7eabd04009d450c60f29f8d097d9570
fourty
as a number in SUTime https://github.com/stanfordnlp/CoreNLP/commit/7fbb7b81d37c24512677f82169ade111c1e023b3
forty (40) days
as a duration in SUTime https://github.com/stanfordnlp/CoreNLP/commit/b3c47a05395b2d515e0f75ca9fafada0099ee758
{
}
as punctuation when scoring English constituency treebanks https://github.com/stanfordnlp/CoreNLP/pull/1385/commits/a606afa9e2906ebad3d860107350f204d7d357d8
Mostly changes to Semgrex, along with adding Ssurgeon to the download package for general consumption. This involved quite a few changes to classes such as AnnotationLookup
. The released version should now match the Semgrex/Ssurgeon paper published at GURT 2023.
relation
in SemanticGraphEdge
final, get rid of setRelation
https://github.com/stanfordnlp/CoreNLP/commit/e7a7657713e6feb2b048eb717a28ba82f2a64fdd
c'mon
and $$$
https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/1e216deaca90c16fdffa396aebbe9d128778c29d
'email'
https://github.com/stanfordnlp/CoreNLP/pull/1332/commits/76b5a6b3c20e518041f988638606cb1e60070be3 https://github.com/stanfordnlp/CoreNLP/issues/1316
LinkedHashMap
in the PTBTokenizer instead of Properties
. Keeps the option processing order predictable. https://github.com/stanfordnlp/CoreNLP/issues/1289 https://github.com/stanfordnlp/CoreNLP/commit/655018895e2f2870ce721de42d31b845fa991335
\r\n
not being properly processed on Windows: #1291 https://github.com/stanfordnlp/CoreNLP/commit/9889f4ef4ee9feb8b70f577db8353c8d6c896ae3
Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex
All PTB and German tokens normalized now in PTBLexer (previously only German umlauts). This makes the tokenizer 2% slower, but should avoid issues with resume' for example https://github.com/stanfordnlp/CoreNLP/commit/d46fecd93c6964f635efe85d9b7c327ee8880fb9
log4j removed entirely from public CoreNLP (internal "research" branch still has a use) https://github.com/stanfordnlp/CoreNLP/commit/f05cb54ec0a4f3c90395771817f44a81eb549baf
Fix NumberFormatException showing up in NER models: https://github.com/stanfordnlp/CoreNLP/issues/547 https://github.com/stanfordnlp/CoreNLP/commit/5ee2c391104109a338a28f35c647b7684b00ad41
Fix "seconds" in the lemmatizer: https://github.com/stanfordnlp/CoreNLP/commit/e7a073bde9ba7bbdb40ba81ed96d379455629e44
Fix double escaping of & in the online demos: https://github.com/stanfordnlp/CoreNLP/commit/8413fa1fc432aa2a13cbb4a296352bb9bad4d0cb
Report the cause of an error if "tregex" is asked for but no parse annotator is added: https://github.com/stanfordnlp/CoreNLP/commit/4db80c051322697c983ecda873d8d38f808cb96c
Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): https://github.com/stanfordnlp/CoreNLP/pull/1259
Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: https://github.com/stanfordnlp/CoreNLP/pull/1263
Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: https://github.com/stanfordnlp/CoreNLP/commit/3c40ba32ca51af02936b907d03406e2158883f7b https://github.com/stanfordnlp/CoreNLP/commit/58a2288239f631df47fac3eed105fe78c08b1a5d https://github.com/stanfordnlp/CoreNLP/commit/8b97d64e48e6d4161f62a8635d2bb4cee2e95553
Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas https://github.com/stanfordnlp/CoreNLP/commit/9476a8eb724e01df4b05bce38789dd8a7e61397c https://github.com/stanfordnlp/CoreNLP/commit/6193934af8ae0abb0b4c6a2522d7efdfa426e5b3 https://github.com/stanfordnlp/CoreNLP/commit/afb1ea89c874acd58bab584f1e29a059c44dfd20 https://github.com/stanfordnlp/CoreNLP/commit/7c84960df4ac9d391ef37855572e2f8bc301ee17
Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases https://github.com/stanfordnlp/CoreNLP/pull/1266
Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) https://github.com/stanfordnlp/CoreNLP/commit/45b47e245c367663bba2e81a26ea7c29262ad0d8
Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models https://github.com/stanfordnlp/CoreNLP/commit/0d9e9c829bfa75bb661cccea03fc682a0f955f0d
Fix NBSP in the Chinese segmenter https://github.com/stanfordnlp/stanza/issues/1052 https://github.com/stanfordnlp/CoreNLP/pull/1279
added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline
tsurgeon CLI - python side added to stanza
https://github.com/stanfordnlp/CoreNLP/pull/1240
sutime WORKDAY definition https://github.com/stanfordnlp/CoreNLP/commit/0dfb11817c2b46a532985c24289e128fbb81a2c0
rebuilt Italian dependency parser using CoreNLP predicted tags
XML security issue: https://github.com/stanfordnlp/CoreNLP/pull/1241
NER server security issue: https://github.com/stanfordnlp/CoreNLP/commit/5ee097dbede547023e88f60ed3f430ff09398b87
fix infinite loop in tregex: https://github.com/stanfordnlp/CoreNLP/pull/1238
json utf-8 output on windows https://github.com/stanfordnlp/CoreNLP/pull/1231 https://github.com/stanfordnlp/stanza/issues/894
fix openie crash in certain unusual graphs https://github.com/stanfordnlp/CoreNLP/pull/1230 https://github.com/stanfordnlp/CoreNLP/issues/1082
fix nondeterministic results in certain SemanticGraph structures https://github.com/stanfordnlp/CoreNLP/pull/1228 https://github.com/stanfordnlp/CoreNLP/commit/cc806f265292977b69fd55f36408fe5ad3a695a0
workaround for NLTK sending % unescaped to the server https://github.com/stanfordnlp/CoreNLP/issues/1226 https://github.com/stanfordnlp/CoreNLP/commit/20fe1e996455b1c1434022d6e7f0b8524f41f253
make TimingTest function on Windows https://github.com/stanfordnlp/CoreNLP/commit/4aafb84f6ea5c0102c921a503cbfb8e3d34f3e22