💫 Industrial-strength Natural Language Processing (NLP) in Python
spacy.ConsoleLogger.v2
optionally saves training logs to JSONL (#11214).DependencyMatcher
to include matching parents or children to the left or the right of the node (#10371).cuda11x
and cuda-autodetect
(using cupy-wheel
) (#11279).Doc.to_json()
and Doc.from_json()
(#11125).enable
and disable
options for spacy.load()
more consistent (#11459).disable
/enclude
/exclude
for spacy.load()
(#11406).--url
flag for spacy info
to print the direct download URL for a pipeline (#11175).spacy project
CLI (#11226).spacy debug data
CLI for spancat data (#11504).spacy_version
in spacy package
metadata (#11552).spacy project assets
(#11458).spacy pretrain
command (#11210).natto-py
for the ko
extra (#11222).This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_*
v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).
Use spacy download
to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0
. You can check that you are using the new version (v3.4.1) with spacy validate
:
NAME SPACY VERSION
en_core_web_md >=3.4.0,<3.5.0 3.4.1 ✔
SetPredicate
.Doc.__init__
.pymorphy2_lookup
lemmatizer mode for Russian and Ukrainian.Doc
type, an error will now be raised (#11424).spacy.models_and_pipes_with_nvtx_range.v1
callback.Example
API documentation.displacy
docs.spacy project dvc
.spacy-wordnet
.initialize()
function for pipeline components.@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy
@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic
{n,m}
operator for Matcher
patterns (#10981).saxpy
/sgemm
provided by the Ops
implementation in order to use Accelerate through thinc-apple-ops
(#10773).Example.get_aligned_parse
and Example.get_aligned
(#10952).StringStore
lookups (#10938).spacy project clone
to try both main
and master
branches by default (#10843).init_config_cli
(#10788).debug data
(#10960).TrainablePipe
components (#10965).SPACY_NUM_BUILD_JOBS
to specify the number of build jobs to run in parallel with pip
(#11073).We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.
Package | UPOS | Parser LAS | NER F |
---|---|---|---|
hr_core_news_sm |
96.6 | 77.5 | 76.1 |
hr_core_news_md |
97.3 | 80.1 | 81.8 |
hr_core_news_lg |
97.5 | 80.4 | 83.0 |
🙏 Special thanks to @gtoffoli for help with the new pipelines!
The English pipelines have new word vectors:
Package | Model Version | TAG | Parser LAS | NER F |
---|---|---|---|---|
en_core_news_md |
v3.3.0 | 97.3 | 90.1 | 84.6 |
en_core_news_md |
v3.4.0 | 97.2 | 90.3 | 85.5 |
en_core_news_lg |
v3.3.0 | 97.4 | 90.1 | 85.3 |
en_core_news_lg |
v3.4.0 | 97.3 | 90.2 | 85.6 |
All CNN pipelines have been extended to add whitespace augmentation.
Doc.has_vector
, distinguish 0-vectors and missing vectors in similarity
warnings.get_array_module
in textcat
.Doc.has_vector
now matches Token.has_vector
and Span.has_vector
: it returns True
if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere
Doc.spans[spans_key]
.Doc
objects.debug data
.Doc
objects.SpanGroup
objects that share the same name within one SpanGroups
container.walk_head_nodes
to avoid acquiring the GIL.StringStore.__getitem__
return type dependent on its parameter type.PhraseMatcher
.SpanGroups.setdefault
to also support Iterable[SpanGroup]
as the default.ROOT
is in the glossary.Doc.has_annotation
and Matcher
.Doc
inputs passed to Language.pipe()
.Doc
.Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the name
attribute. For example, the following pipeline component:
[components.transformer]
factory = "transformer"
name = "custom_transformer_name"
would be registered erroneously as custom_transformer_name
. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer
.
@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg
spacy.Tagger.v2
to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197).Ragged
with faster AlignmentArray
in Example
for training (#10319).Matcher
speed (#10659).Doc.spans
(#10250).spacy init config -p trainable_lemmatizer
or using the quickstart.thinc
v8.0.14+ and thinc-bigendian-ops
.spacy debug diff-config
.SpanCategorizer.set_candidates
for debugging span suggesters.spancat
and trainable_lemmatizer
components.v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm |
Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md |
Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg |
Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm |
Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md |
Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg |
Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm |
Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md |
Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg |
Swedish | 96.3 | 79.1 | 81.1 |
🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
---|---|---|
da_core_news_md |
84.9 | 94.8 |
de_core_news_md |
73.4 | 97.7 |
el_core_news_md |
56.5 | 88.9 |
fi_core_news_md |
- | 86.2 |
it_core_news_md |
86.6 | 97.2 |
ko_core_news_md |
- | 90.0 |
lt_core_news_md |
71.1 | 84.8 |
nb_core_news_md |
76.7 | 97.1 |
nl_core_news_md |
81.5 | 94.0 |
pl_core_news_md |
87.1 | 93.7 |
pt_core_news_md |
76.7 | 96.9 |
ro_core_news_md |
81.8 | 95.5 |
sv_core_news_md |
- | 95.5 |
Scorer.score_cats
for missing labels._
value for UPOS in CoNLL-U converter.Span
attributes consistently."spans"
to the output of doc.to_json
.Matcher
handling for all special cases.Example
to align whitespace annotation.Tok2Vec
for empty batches.rehearse
.Vectors.n_keys
for floret vectors.meta
in util.load_model_from_config
.Example.get_matching_ents
.Tokenizer.explain
.KoreanTokenizer
tag map.init vectors
.Tagger
architecture, edit your configs to switch from spacy.Tagger.v1
to spacy.Tagger.v2
and then run init fill-config
.<
, <=
, >
, >=
) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).Doc.from_docs
now includes Doc.tensor
by default and supports excludes with an exclude
argument in the same format as Doc.to_bytes
. The supported exclude fields are spans
, tensor
and user_data
.@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996
spancat
for empty docs and zero suggestions.Lexeme.rank
.Tok2Vec
for empty batches.@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz