π« Industrial-strength Natural Language Processing (NLP) in Python
parser
and ner
speeds on long documents (see technical details in #10019).spancat
components in debug data
.ENT_IOB
as a Matcher
token pattern key.ENT_IOB
.debug data
.Lexeme.rank
.spacy project
.Doc.from_docs()
for empty docs.debug data
for components with custom names.Underscore
and DependencyMatcher
and improve types in Language
, Matcher
and PhraseMatcher
.Tokenizer.explain
when infixes appear as prefixes.spancat
initialization.IS_SENT_END
in Doc.has_annotation
.spacy package
.PhraseMatcher
.Dockerfile
for repeatable website builds and easier local development.@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav
doc_cleaner
component for removing doc.tensor
,doc._._trf_data
or other Doc
attributes at the end of the pipeline to reduce size of output docs.ENT_ID
and ENT_KB_ID
to Matcher
pattern attributes.kb_id
for entities in displaCy from Doc
input.Span.sents
property for spans spanning over more than one sentence.EntityRuler.remove
to remove patterns by id
.Tagger
neg_prefix
configurable.Language.pipe
in Language.evaluate
for more efficient processing.JsonlCorpus
path optional again.spancat
for empty docs and zero suggestions..jsonl
paths in EntityRuler
.Scorer.score_spans
to handle predicted docs with missing annotation.parser
from reference parse rather than aligned example.tagger
and morphologizer
.init_tok2vec
after pretraining, batch contract for listeners.eng-spacysentiment
: Sentiment analysis for English.@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar
nlp()
and nlp.pipe()
accept Doc
input, which simplifies setting custom tokenization or extensions before processing.overwrite
config settings for entity_linker
, morphologizer
, tagger
, sentencizer
and senter
.extend
config setting for morphologizer
for whether existing feature types are preserved.spacy.blank()
including IETF language tags, for example fra
for French
and zh-Hans
for Chinese
.spacy-loggers
for additional loggers.sudachipy
are annotated as Token.morph
features.morph_micro_p/r/f
scores for morphological features from Scorer.score_morph_per_feat()
.LIKE_URL
attribute includes the tokenizer URL pattern.--n-save-epoch
option for spacy pretrain
.ja_core_news_trf
, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!tok2vec
feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.Token.pos
and Token.morph
.For more details, see the New in v3.2 usage guide.
Language.pipe(as_tuples=True)
for multiprocessing with custom error handlers.Tokenizer
.Tokenizer
, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of Β°[cfk].
is now Β° c .
instead of Β° c.
for most languages.ChineseTokenizer
, JapaneseTokenizer
, KoreanTokenizer
, ThaiTokenizer
and VietnameseTokenizer
require Vocab
rather than Language
in __init__
.DocBin
, user data is now always serialized according to the store_user_data
option, see #9190.pipelines/floret_vectors_demo
: basic floret vector training and importing.pipelines/floret_fi_core_demo
: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipelines/floret_ko_ud_demo
: Korean UD vector and pipeline training, comparing standard vs. floret vectors.@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
AppleOps
: pip install spacy[apple]
.spacy.models_with_nvtx_range.v1
.mypy
integration in the CI and many type fixes across the code base.Protocol
classes in ty.py
to define behavior of pipeline components.displacy
.spacy project assets
.train
function to run the training from Python scripts just like the spacy train
CLI.spacy-transformers>=1.1.0
with improved IO.thinc>=8.0.11
with improved gradient clipping.KnowledgeBase.set_entities
.DocBin
constructor.spacy project
title.DependencyMatcher
.textcat
and textcat_multilabel
configurations.Doc
object creation.convert
CLI..pyi
files in the distributed package.deplacy
: CUI-based dependency visualizeripymarkup
: Visualizations for NER and syntax treesPhruzzMatcher
: Find fuzzy matchesspacy-huggingface-hub
: Push spaCy pipelines to the Hugging Face HubspaCyOpenTapioca
: Entity Linking on Wikidataspacy-clausie
: Clause-based information extraction system@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker
v3
of WandbLogger
now supports optional run_name
and entity
parameters.pos
values for a Doc
or Token
.Matcher
callbacks.config
in create_pipe
.typer
0.4 to provide support for both Click 7 and Click 8.spacy project
workflows.repo
and path
arguments in spacy project
.epoch_resume
in spacy pretrain
.spacy-legacy
in spacy package
dependency detection.spacy package
.StringStore
and the Vocab
.@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker
SpanCategorizer
predictions..pyi
stub files.spacy package
.INTERSECTS
operator for the Matcher.spacy project
push
and pull
commands.Span.as_doc
calls.da
transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo
).debug data
runs correctly with a custom tokenizer.ISSUBSET
and ISSUPERSET
in schema and docs.no_skip
value for spacy project run
.ConsoleLogger
flush after each logging line.exclude
when serializing the vocab.allow_overlap
default for span categorizer scoring._SP
.@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker
debug data
.load_lookups
return type and docstring.EntityLinker
robust for nO=None
.minn
is not set.debug model
for transformers.ENT_KB_ID
in ner
annotation.Matcher(as_spans)
on spans.Doc.from_docs()
for all empty docs.textcat
with listener.ENT_ID
and NORM
to DocBin
strings.Span.as_doc
.Span
attrs writable.debug data
for textcat
.DocBin
is too large.to/from_bytes
for KnowledgeBase
and EntityLinker
.Span.get_lca_matrix
.attrs.IDS
.spacy.batch_by_words.v1
.EntityRuler
: ent_ids
returns None
for phrases.EntityRuler
.pymorphy2
requirement to pymorphy2
mode in Russian and Ukrainian lemmatizers.Doc
.Span.lemma_
.JsonlReader
path optional.Example.from_dict
.Doc.from_docs
.textcat
with <2 labels.@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD
noun_chunk
iterator for Dutch.black
& flake8
as pre-commit hooks.spacy.ngram_range_suggester.v1
for suggesting a range of n-gram sizes for the spancat
component.ru
and uk
multiprocessing (with spawn
).meta
information with spacy package
.replace_pipe
takes disabled components into account.@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe
SpanCategorizer
component for labeling arbitrary and potentially overlapping spans of text.[training.annotating_components]
config setting.EntityRecognizer
with known incorrect span annotations.README.md
based on the meta in spacy package
.For more details, see the New in v3.1 usage guide.
Package | Language | UPOS | Parser LAS | Β NER F |
---|---|---|---|---|
ca_core_news_sm |
Catalan | 98.2 | 87.4 | 79.8 |
ca_core_news_md |
Catalan | 98.3 | 88.2 | 84.0 |
ca_core_news_lg |
Catalan | 98.5 | 88.4 | 84.2 |
ca_core_news_trf |
Catalan | 98.9 | 93.0 | 91.2 |
da_core_news_trf |
Danish | 98.0 | 85.0 | 82.9 |
spacy_version
in your model package meta to ">=3.0.0,<3.2.0"
. If you run into degraded performance, retrain your pipeline with v3.1.spacy init fill-config
to update a v3.0 config for v3.1.[initialize.vectors]
.warnings.filterwarnings
or the new helper method spacy.errors.filter_warning(action, error_msg='')
to manage warnings.For more information, see Notes on upgrading from v3.0.
spacy ray
command works.debug data
.EntityLinker
robust for nO=None.minn
is not set.debug model
for transformers.ENT_KB_ID
in ner
annotation.Doc.from_docs()
for all empty docs.textcat
with listener.ENT_ID
and NORM
to DocBin
strings.Span.as_doc
.Span
attrs writable.debug data
for textcat
.DocBin
is too large.to/from_bytes
for KnowledgeBase
and EntityLinker
.Span.get_lca_matrix
.attrs.IDS
.spacy.batch_by_words.v1
.EntityRuler
: ent_ids
returns None for phrases.EntityRuler
.Doc
.Span.lemma_
.Example.from_dict
.Language.pipe
return values.Doc.from_docs
.textcat
with <2 labels.@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD