SpaCy Versions Save

💫 Industrial-strength Natural Language Processing (NLP) in Python

v3.0.8

2 years ago

🔴 Bug fixes

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines

v3.2.2

2 years ago

✨ New features and improvements

Improved parser and ner speeds on long documents (see technical details in #10019).
Support for spancat components in debug data.
Support for ENT_IOB as a Matcher token pattern key.
Extended and improved types for many classes.

🔴 Bug fixes

Fix issue #9735: Make floret murmurhash endian-neutral.
Fix issue #9738: Support string IOB values for ENT_IOB.
Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
Fix issue #9960: Warn about entities that cross sentence boundaries in debug data.
Fix issue #9979: Fix type for Lexeme.rank.
Fix issue #10026: Check for 0-size assets in spacy project.
Fix issue #10051: Consistently return scalars from similarity methods.
Fix issue #10052: Fix spaces in Doc.from_docs() for empty docs.
Fix issue #10079: Fix label detection in debug data for components with custom names.
Fix issue #10109: Add types to Underscore and DependencyMatcher and improve types in Language, Matcher and PhraseMatcher.
Fix issue #10130: Fix Tokenizer.explain when infixes appear as prefixes.
Fix issue #10143: Use simple suggester in spancat initialization.
Fix issue #10164: Support IS_SENT_END in Doc.has_annotation.
Fix issue #10192: Detect invalid package names in spacy package.
Fix issue #10223: Support mixed case in package names.
Fix issue #10234: Fix type in PhraseMatcher.

📖 Documentation and examples

Various documentation updates.
New spaCy version tags in spaCy universe.
New Dockerfile for repeatable website builds and easier local development.
New additions to spaCy universe:
- Augmenty: a text augmentation library
- Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
- spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
- spacypdfreader: easy PDF to text to spaCy text extraction
- textnets: text analysis with networks

👥 Contributors

@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav

v3.2.1

2 years ago

✨ New features and improvements

NEW: doc_cleaner component for removing doc.tensor,doc._._trf_data or other Doc attributes at the end of the pipeline to reduce size of output docs.
NEW: ENT_ID and ENT_KB_ID to Matcher pattern attributes.
Support kb_id for entities in displaCy from Doc input.
Add Span.sents property for spans spanning over more than one sentence.
Add EntityRuler.remove to remove patterns by id.
Make the Tagger neg_prefix configurable.
Use Language.pipe in Language.evaluate for more efficient processing.
Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.

🔴 Bug fixes

Fix issue #9638: Make JsonlCorpus path optional again.
Fix issue #9654: Fix spancat for empty docs and zero suggestions.
Fix issue #9658: Improve error message for incorrect .jsonl paths in EntityRuler.
Fix issue #9674: Fix language-specific factory handling in package CLI.
Fix issue #9694: Convert labels to strings for README in package CLI.
Fix issue #9697: Exclude strings from source vector checks.
Fix issue #9701: Allow Scorer.score_spans to handle predicted docs with missing annotation.
Fix issue #9722: Initialize parser from reference parse rather than aligned example.
Fix issue #9764: Set annotations more efficiently in tagger and morphologizer.

📖 Documentation and examples

Various documentation updates: init_tok2vec after pretraining, batch contract for listeners.
New additions to the spaCy universe:
- eng-spacysentiment: Sentiment analysis for English.
- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.

👥 Contributors

@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar

v3.2.0

2 years ago

✨ New features and improvements

NEW: Registered scoring functions for each component in the config.
NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
extend config setting for morphologizer for whether existing feature types are preserved.
Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
New package spacy-loggers for additional loggers.
New Irish lemmatizer.
New Portuguese noun chunks and updated Spanish noun chunks.
Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
Japanese reading and inflection from sudachipy are annotated as Token.morph features.
Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
LIKE_URL attribute includes the tokenizer URL pattern.
--n-save-epoch option for spacy pretrain.
Trained pipelines:
- New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
- Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
- English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

Demo projects for floret vectors:
- pipelines/floret_vectors_demo: basic floret vector training and importing.
- pipelines/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.
- pipelines/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

v3.1.4

2 years ago

✨ New features and improvements

NEW: Binary wheels for Python 3.10.
NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].
GPU profiling with spacy.models_with_nvtx_range.v1.
Full mypy integration in the CI and many type fixes across the code base.
Added custom Protocol classes in ty.py to define behavior of pipeline components.
Support for entity linking visualization in displacy.
Allow overriding vars in spacy project assets .
Standalone train function to run the training from Python scripts just like the spacy train CLI.
Support for spacy-transformers>=1.1.0 with improved IO.
Support for thinc>=8.0.11 with improved gradient clipping.

🔴 Bug fixes

Fix issue #5507: Improve UX for multiprocessing on GPU.
Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.
Fix issue #9244: Fix vectors for 0-length spans.
Fix issue #9247: Improve UX for the DocBin constructor.
Fix Issue #9254: Allow unicode in a spacy project title.
Fix issue #9263: Make added patterns consistent in the DependencyMatcher.
Fix issue #9305: Restore tokenization timing during evaluation.
Fix issue #9335: Sync vocab in vectors and sourced components.
Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.
Fix issue #9437: Improve UX around Doc object creation.
Fix issue #9465: Fix minor issues with convert CLI.
Fix issue #9500: Include .pyi files in the distributed package.

📖 Documentation and examples

Various updates to the documentation.
New additions to the spaCy universe:
- deplacy: CUI-based dependency visualizer
- ipymarkup: Visualizations for NER and syntax trees
- PhruzzMatcher: Find fuzzy matches
- spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub
- spaCyOpenTapioca: Entity Linking on Wikidata
- spacy-clausie: Clause-based information extraction system
- "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
- "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

👥 Contributors

@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker

v3.1.3

2 years ago

✨ New features and improvements

The v3 of WandbLogger now supports optional run_name and entity parameters.
Improved UX when providing invalid pos values for a Doc or Token.

🔴 Bug fixes

Fix issue #9001: Pass alignments to Matcher callbacks.
Fix issue #9009: Include component factories in third-party dependencies resolver.
Fix issue #9012: Correct type of config in create_pipe.
Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.
Fix issue #9033: Fix verbs list for French tokenizer exceptions.
Fix issue #9059: Pass overrides to subcommands in spacy project workflows.
Fix issue #9074: Improve UX around repo and path arguments in spacy project.
Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.
Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.
Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

📖 Documentation and examples

Various updates to the documentation.
Few additions and updates to the spaCy universe.
Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

👥 Contributors

@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker

v3.1.2

2 years ago

✨ New features and improvements

NEW: Provide scores for the SpanCategorizer predictions.
NEW: Broader compatibility with type checkers thanks to .pyi stub files.
NEW: Auto-detect package dependencies in spacy package.
New INTERSECTS operator for the Matcher.
More debugging info for spacy project push and pull commands.
Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.
The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

🔴 Bug fixes

Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.
Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.
Fix issue #8796: Respect the no_skip value for spacy project run.
Fix issue #8810: Make ConsoleLogger flush after each logging line.
Fix issue #8819: Pass exclude when serializing the vocab.
Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
Fix issue #8970: Fix allow_overlap default for span categorizer scoring.
Fix issue #8982: Add glossary entry for _SP.
Fix issue #9007: Fix span categorizer training on nested entities.

📖 Documentation and examples

New developer documentation covering spaCy's internals and code conventions.
Added a documentation section on preparing training data in spaCy's binary format.
Updated some error/log messages to be more informative.
Various updates to the documentation.
A few new additions to the spaCy universe.

👥 Contributors

@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker

v3.0.7

2 years ago

✨ New features and improvements

Alpha tokenization support for Azerbaijani.
Updates for French stop words.

🔴 Bug fixes

Fix issue #7629: Fix scoring normalization.
Fix issue #7886: Fix unknown tokens percentage in debug data.
Fix issue #7907: Update load_lookups return type and docstring.
Fix issue #7930: Make EntityLinker robust for nO=None.
Fix issue #7925: Skip vector ngram backoff if minn is not set.
Fix issue #7973: Fix debug model for transformers.
Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.
Fix issue #8004: Handle errors while multiprocessing.
Fix issue #8009: Fix Doc.from_docs() for all empty docs.
Fix issue #8012: Fix ensemble textcat with listener.
Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
Fix issue #8055: Handle partial entities in Span.as_doc.
Fix issue #8062: Make all Span attrs writable.
Fix issue #8066: Update debug data for textcat.
Fix issue #8069: Custom warning if DocBin is too large.
Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
Fix issue #8116: Fix offsets in Span.get_lca_matrix.
Fix issue #8132: Remove unsupported attrs from attrs.IDS.
Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
Fix issue #8208: Address missing config overrides post load of models.
Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
Fix issue #8216: Don't add duplicate patterns in EntityRuler.
Fix issue #8244: Use context manager when reading model file.
Fix issue #8245: Fix other open calls without context managers.
Fix issue #8265: Address mypy errors.
Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.
Fix issue #8335: Raise error if deps not provided with heads in Doc.
Fix issue #8368: Preserve whitespace in Span.lemma_.
Fix issue #8396: Make JsonlReader path optional.
Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
Fix issue #8426: Fix setting empty entities in Example.from_dict.
Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
Fix issue #8584: Raise an error for textcat with <2 labels.
Fix issue #8551: Fix duplicate spacy package CLI opts.

👥 Contributors

@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD

v3.1.1

2 years ago

✨ New features and improvements

Alpha tokenization support for Ancient Greek.
Implementation of a noun_chunk iterator for Dutch.
Support for black & flake8 as pre-commit hooks.
New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

🔴 Bug fixes

Fix issue #8638: Fix Azerbaijani initialization.
Fix issue #8639: Use 0-vector for OOV lexemes.
Fix issue #8640: Update lexeme ranks for loaded vectors.
Fix issue #8651: Fix ru and uk multiprocessing (with spawn).
Fix issue #8663: Preserve existing meta information with spacy package.
Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe

v3.1.0

2 years ago

✨ New features and improvements

NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
NEW: Experimental SpanCategorizer component for labeling arbitrary and potentially overlapping spans of text.
NEW: Use predicted annotations during training via the [training.annotating_components] config setting.
Alpha tokenization support for Azerbaijani.
Part-of-speech tag-based lemmatizers for Catalan and Italian.
The TextCatCNN and TextCatBOW architectures are now resizable.
Support updating the EntityRecognizer with known incorrect span annotations.
Auto-generate a pretty README.md based on the meta in spacy package.

For more details, see the New in v3.1 usage guide.

📦 New trained pipelines

Package	Language	UPOS	Parser LAS	NER F
`ca_core_news_sm`	Catalan	98.2	87.4	79.8
`ca_core_news_md`	Catalan	98.3	88.2	84.0
`ca_core_news_lg`	Catalan	98.5	88.4	84.2
`ca_core_news_trf`	Catalan	98.9	93.0	91.2
`da_core_news_trf`	Danish	98.0	85.0	82.9

⚠️ Upgrading from v3.0

Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.
Use spacy init fill-config to update a v3.0 config for v3.1.
When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].
Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

For more information, see Notes on upgrading from v3.0.

🔴 Bug fixes

Fix issue #7036: Use a context manager when reading model.
Fix issue #7629: Fix scoring normalization.
Fix issue #7799: Ensure spacy ray command works.
Fix issue #7807: Show warning if entity ruler runs without patterns.
Fix issue #7886: Fix unknown tokens percentage in debug data.
Fix issue #7930: Make EntityLinker robust for nO=None.
Fix issue #7925: Skip vector ngram backoff if minn is not set.
Fix issue #7973: Fix debug model for transformers.
Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
Fix issue #8004: Handle errors while multiprocessing.
Fix issue #8009: Fix Doc.from_docs() for all empty docs.
Fix issue #8012: Fix ensemble textcat with listener.
Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
Fix issue #8055: Handle partial entities in Span.as_doc.
Fix issue #8062: Make all Span attrs writable.
Fix issue #8066: Update debug data for textcat.
Fix issue #8069: Custom warning if DocBin is too large.
Fix issue #8099: Update Vietnamese tokenizer.
Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
Fix issue #8116: Fix offsets in Span.get_lca_matrix.
Fix issue #8132: Remove unsupported attrs from attrs.IDS.
Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
Fix issue #8208: Address missing config overrides post load of models.
Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
Fix issue #8216: Don't add duplicate patterns in EntityRuler.
Fix issue #8265: Address mypy errors.
Fix issue #8335: Raise error if deps not provided with heads in Doc.
Fix issue #8368: Preserve whitespace in Span.lemma_.
Fix issue #8388: Don't clobber vectors when loading components from source models.
Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
Fix issue #8426: Fix setting empty entities in Example.from_dict.
Fix issue #8441: Add correct types for Language.pipe return values.
Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
Fix issue #8559: Fix vectors check for sourced components.
Fix issue #8584: Raise an error for textcat with <2 labels.

👥 Contributors

@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD