NLP, before and after spaCy
Took a (longer than expected) break from NLP, so this release is mostly just maintenance and bug fixes — but in anticipation of more interesting updates to come.
pyproject.toml
fileruff
plus newer versions of mypy
and black
, and their use in GitHub Actions CI has been consolidatedmypy
complaints by ~80% (PR #372)Big thanks to @jonwiggins, @Hironsan, amnd @kevinbackhouse for the fixes!
Doc
as their first positional arg, suitable for use as custom doc extensions (see below)TextStats
class, since other methods for accessing the underlying functionality were made more accessible and convenient, and there's no longer need for a third method.Now, custom extensions are accessed by name, and users have more control over the process:
>>> import textacy
>>> from textacy import extract, text_stats
>>> textacy.set_doc_extensions("extract")
>>> textacy.set_doc_extensions("text_stats.readability")
>>> textacy.remove_doc_extensions("extract.matches")
>>> textacy.make_spacy_doc("This is a test.", "en_core_web_sm")._.flesch_reading_ease()
118.17500000000001
Moved top-level extensions into spacier.core
and extract.bags
Standardized extract
and text_stats
subpackage extensions to use the new setup, and made them more customizable
pytest
conftest file to improve maintainability and consistency of unit test suite (PR #353)setup.py
and switched from setuptools
to build
for buildspyproject.toml
Makefile
TextStats
docs (PR #331, Issue #334)ConceptNet
data on Windows systems (Issue #345)Thanks to @austinjp, @scarroll32, @MirkoLenz for their help!
This is probably the largest single update in textacy
's history. The changes necessary for upgrading to spaCy v3 prompted a cascade of additional updates, quality-of-life improvements, expansions and retractions of scope, and general package cleanup to better align textacy
with its primary dependency and set it up for future updates. Note that this version includes a number of breaking changes; most are minor and have easy fixes, but some represent actual shifts in functionality. Read on for details!
textacy.preprocessing
)
normalize.bullet_points()
), removing HTML tags (remove.html_tags()
), and removing bracketed contents such as in-line citations (remove.brackets()
).make_pipeline()
function for combining multiple preprocessors applied sequentially to input text into a single callable.preprocessing.normalize_whitespace()
=> preprocessing.normalize.whitespace()
.replace_with
=> repl
, and remove.punctuation(text, marks=".?!")
=> remove.punctuation(text, only=[".", "?", "!"])
.textacy.extract
)
extract.py
and text_utils.py
modules and ke
subpackage. For the latter two, imports have changed:
from textacy import ke; ke.textrank()
=> from textacy import extract; extract.keyterms.textrank()
from textacy import text_utils; text_utils.keywords_in_context()
=> from textacy import extract; extract.keywords_in_context()
extract.regex_matches()
: For matching regex patterns in a document's text that cross spaCy token boundaries, with various options for aligning matches back to tokens.extract.acronyms()
: For extracting acronym-like tokens, without looking around for related definitions.extract.terms()
: For flexibly combining n-grams, entities, and noun chunks into a single collection, with optional deduplication.("I", "like", "movie")
which is... misleading. The new approach uses lists of tokens that need not be adjacent; in this case, it produces (["I"], ["did", "not", "like"], ["movie"])
. For convenience, triple results are all named tuples, so elements may be accessed by name or index (e.g. svo.subject
== svo[0]
).extract.keywords_in_context()
to always yield results, with optional padding of contexts, leaving printing of contexts up to users; also extended it to accept Doc
or str
objects as input.extract.pos_regex_matches()
function, which is superseded by the more powerful extract.token_matches()
.textacy.similarity
)
similarity.py
module into a subpackage, with metrics split out into categories: edit-, token-, and sequence-based approaches, as well as hybrid metrics.similarity.jaro()
)similarity.cosine()
), Bag (similarity.bag()
), and Tversky (similarity.tvserky()
)similarity.matching_subsequences_ratio()
)similarity.monge_elkan()
)Doc.similarity
.textacy.representations
)
representations.network
module
build_cooccurrence_network()
function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes for each unique string and edges to other strings that co-occurred.build_similarity_network()
function to represent a sequence of strings (or a sequence of such sequences) as a graph with nodes as top-level elements and edges to all others weighted by pairwise similarity.network.py
module and duplicative extract.keyterms.graph_base.py
module.vsm.vectorizers
to representations.vectorizers
module.
Vectorizer
and GroupVectorizer
, applying global inverse document frequency weights is now handled by a single arg: idf_type: Optional[str]
, rather than a combination of apply_idf: bool, idf_type: str
; similarly, applying document-length weight normalizations is handled by dl_type: Optional[str]
instead of apply_dl: bool, dl_type: str
representations.sparse_vec
module for higher-level access to document vectorization via build_doc_term_matrix()
and build_grp_term_matrix()
functions, for cases when a single fit+transform is all you need.textacy.lang_id
)
lang_utils.py
module into a subpackage, and added the primary user interface (identify_lang()
and identify_topn_langs()
) as package-level imports.thinc
-based language identification model that's closer to the original CLD3 inspiration, replacing the simpler sklearn
-based pipeline.textacy.load_spacy_lang()
to only accept full spaCy language pipeline names or paths, in accordance with v3's removal of pipeline aliases and general tightening-up on this front. Unfortunately, textacy
can no longer play fast and loose with automatic language identification => pipeline loading...textacy.make_spacy_doc()
to accept a chunk_size
arg that splits input text into chunks, processes each individually, then joins them into a single Doc
; supersedes spacier.utils.make_doc_from_text_chunks()
, which is now deprecated.Doc
extensions into a top-level extensions.py
module, and improved/streamlined the collection
Doc._.to_bag_of_words()
and Doc._.to_bag_of_terms()
, leveraging related functionality in extract.words()
and extract.terms()
Doc._.lang
=> use Doc.lang_
Doc._.tokens
=> use iter(Doc)
Doc._.n_tokens
=> len(Doc)
Doc._.to_terms_list()
=> extract.terms(doc)
or Doc._.extract_terms()
Doc._.to_tagged_text()
=> NA, this was an old holdover that's not used in practice anymoreDoc._.to_semantic_network()
=> NA, use a function in textacy.representations.networks
Doc
extensions for textacy.extract
functions (see above for details), with most functions having direct analogues; for example, to extract acronyms, use either textacy.extract.acronyms(doc)
or doc._.extract_acronyms()
. Keyterm extraction functions share a single extension: textacy.extract.keyterms.textrank(doc)
<> doc._.extract_keyterms(method="textrank")
DocBin
for efficiently saving/loading Doc
s in binary format, with corresponding arg changes in io.write_spacy_docs()
and Corpus.save()
+.load()
pyemd
and srsly
numpy
and scikit-learn
cytoolz
, jellyfish
, matplotlib
, pyphen
, and spacy
(v3.0+ only!)textacy.export
module, which had functions for exporting spaCy docs into other external formats; this was a soft dependency on gensim
and CONLL-U that wasn't enforced or guaranteed, so better to remove.types.py
module for shared types, and used them everywhere. Also added/fixed type annotations throughout the code base.Many thanks to @timgates42, @datanizing, @8W9aG, @0x2b3bfa0, and @gryBox for submitting PRs, either merged or used as inspiration for my own rework-in-progress.
text_stats
module into a sub-package with the same name and top-level API, but restructured under the hood for better consistencyTextStats
class, and improved documentation on many of the individual stats functionsTextStats.basic_counts
and TextStats.readability_stats
attributes, since typically only one or a couple needed for a given use case; also, some of the readability tests are language-specific, which meant bad results could get mixed in with good oneserrors.py
modulestr.format()
with f-strings (almost) everywhere, for performance and readabilitypyproject.toml
package configuration standard; updated and streamlined setup.py
and setup.cfg
accordingly; and removed requirements.txt
/src
directory, for technical reasonsmypy
-specific config file to reduce output noisiness when type-checkingrecommonmark
instead of m2r
, and migrated all "narrative" docs from .rst
to equivalent .md
filesscikit-learn==0.23.0
, and bumped the upper bound on that dependency's version accordinglypytest
functionality (PR #306)textacy
versions 0.9.1 and 0.10.0 up on conda-forge
(Issue #294)pandas.DataFrame
functionality, and otherwise tidied up the default for nice-looking plots (PR #295)delete_words()
augmentation transform (Issue #308)Special thanks to @tbsexton, @marius-mather, and @rmax for their contributions! 💐
Corpus
functionality using recent additions to spacy (PR #285)
Corpus.save()
and Corpus.load()
using spacy's new DocBin
class, which resolved a few bugs/issues (Issue #254)n_process
arg to Corpus.add()
to set the number of parallel processes used when adding many items to a corpus, following spacy's updates to nlp.pipe()
(Issue #277)normalize_whitespace()
function (Issue #278)LangIdentifier
model using scikit-learn==0.22
, to prevent ambiguous errors when trying to load a file that didn't exist (Issues #291, #292)TopicModel
class to work with newer versions of scikit-learn
, and updated version requirements accordingly from >=0.18.0,<0.21.0
to >=0.19
scikit-learn==0.19
to prevent errors for users on that versionNote: textacy
is now PY3-only! 🎉 Specifically, support for PY2.7 has been dropped, and the minimum PY3 version has been bumped to 3.6 (PR #261). See below for related changes.
augmentation
subpackage for basic text data augmentation (PR #268, #269)
Augmenter
class for combining multiple transforms and applying them to spaCy Doc
s in a randomized but configurable mannerresources
subpackage for standardized access to linguistic resources (PR #265)
lexicon_methods.py
module with previous implementationUDHR
dataset, a collection of translations of the Universal Declaration of Human Rights (PR #271)pathlib.Path
objects, with pathlib
adopted widely under the hoodjellyfish
, networkx
, and numpy
text_stats
(PR #263)spacier.core
, out of cache.py
and doc.py
dataset.utils
to io.utils
and utils.py
cache.py
and into text_stats.py
, where it's usedtextacy.io.split_record_fields()
functionpreprocessing.replace_urls()
to properly handle certain edge case URLs (Issue #267)Thanks much to @hugoabonizio for the contribution. 🤝
preprocess
module into a preprocessing
sub-package, and reorganized it in the processreplace_hashtags()
to replace hashtags like #FollowFriday
or #spacyIRL2019
with _TAG_
replace_user_handles()
to replace user handles like @bjdewilde
or @spacy_io
with _USER_
replace_emojis()
to replace emoji symbols like 😉 or 🚀 with _EMOJI_
normalize_hyphenated_words()
to join hyphenated words back together, like antici- pation
=> anticipation
normalize_quotation_marks()
to replace "fancy" quotation marks with simple ascii equivalents, like “the god particle”
=> "the god particle"
replace_currency_symbols()
now replaces all dedicated ascii and unicode currency symbols with _CUR_
, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like $
=> USD
)remove_punct()
now has a fast (bool)
kwarg rather than method (str)
normalize_contractions()
, preprocess_text()
, and fix_bad_unicode()
functions, since they were bad/awkward and more trouble than they were worthkeyterms
module into a ke
sub-package, and cleaned it up / standardized arg names / better shared functionality in the processke.yake()
), sCAKE (ke.scake()
), and PositionRank (ke.textrank()
, with non-default parameter values)ke.utils.get_longest_subsequence_candidates()
) and pattern-matching candidates (ke.utils.get_pattern_matching_candidates()
)similarity.character_ngrams()
), for something that's useful in different contexts than the other measuressimilarity.jaro_winkler()
), since it didn't add much beyond other measurespython-levenshtein
dependency with jellyfish
, for its active development, better documentation, and actually-compliant licenseDoc._.to_bag_of_words()
and Corpus.word_counts()
for filtering out stop words, punctuation, and/or numbers (PR #249)sklearn
-style topic modeling classes to be passed into tm.TopicModel()
(PR #248)matplotlib
when drawing a "termite" plot in viz.draw_termite_plot()
(PR #248)io.utils.get_filenames()
and spacier.components.merge_entities()
Huge thanks to @kjoshi and @zf109 for the PRs! 🙌
scikit-learn
and trained it on ~1.5M texts in ~130 different languages spanning a wide variety of subject matter and stylistic formality; overall, speed and performance compare favorably to other open-source options (langid
, langdetect
, cld2-cffi
, and cld3
)cld2-cffi
dependency [Issue #246]extract.matches()
function to extract spans from a document matching one or more pattern of per-token (attribute, value) pairs, with optional quantity qualifiers; this is a convenient interface to spaCy's rule-based Matcher
and a more powerful replacement for textacy's existing (now deprecated) extract.pos_regex_matches()
preprocess.normalize_unicode()
function to transform unicode characters into their canonical forms; this is a less-intensive consolation prize for the previously-removed fix_unicode()
functionLanguage
pipelines (tokenization only -- no model-based tagging, parsing, etc.) via load_spacy_lang(name, allow_blank=True)
for use cases that don't rely on annotations; disabled by default to avoid unwelcome surprisesto_terms_list()
[Issues #169, #179]
entities = True
=> include entities, and drop exact duplicate ngramsentities = False
=> don't include entities, and also drop exact duplicate ngramsentities = None
=> use ngrams as-is without checking against entitiesto_collection()
function from the datasets.utils
module to the top-level utils
module, for use throughout the code basequoting
option to io.read_csv()
and io.write_csv()
, for problematic casesspacier.components.merge_entities()
pipeline component, an implementation of which has since been added into spaCy itself.rst
to .md
formatNotImplementedError
previously added to preprocess.fix_unicode()
is now raised rather than returned [Issue #243]New and Changed:
Removed textacy.Doc
, and split its functionality into two parts
textacy.make_spacy_doc()
as a convenient and flexible entry point
for making spaCy Doc
s from text or (text, metadata) pairs, with optional
spaCy language pipeline specification. It's similar to textacy.Doc.__init__
,
with the exception that text and metadata are passed in together as a 2-tuple.spacy.tokens.Doc
class, accessible via its Doc._
"underscore"
property. These are similar to the properties/methods on textacy.Doc
,
they just require an interstitial underscore. For example,
textacy.Doc.to_bag_of_words()
=> spacy.tokens.Doc._.to_bag_of_words()
.Simplified and improved performance of textacy.Corpus
Corpus.__init__
or Corpus.add()
; they may be one or a stream of texts, (text, metadata)
pairs, or existing spaCy Doc
s. When adding many documents, the spaCy
language processing pipeline is used in a faster and more efficient way.Corpus
is now a collection of spaCy Doc
s rather than textacy.Doc
s.Simplified, standardized, and added Dataset
functionality
IMDB
dataset, built on the classic 2011 dataset
commonly used to train sentiment analysis models.Wikimedia
dataset, from which a reworked
Wikipedia
dataset and a separate Wikinews
dataset inherit.
The underlying data source has changed, from XML db dumps of raw wiki markup
to JSON db dumps of (relatively) clean text and metadata; now, the code is
simpler, faster, and totally language-agnostic.Dataset.records()
now streams (text, metadata) pairs rather than a dict
containing both text and metadata, so users don't need to know field names
and split them into separate streams before creating Doc
or Corpus
objects from the data..texts()
and .records()
methods on
a given Dataset
--- and more performant!datasets.utils
module.Quality of life improvements
Reduced load time for import textacy
from ~2-3 seconds to ~1 second,
by lazy-loading expensive variables, deferring a couple heavy imports, and
dropping a couple dependencies. Specifically:
ftfy
was dropped, and a NotImplementedError
is now raised
in textacy's wrapper function, textacy.preprocess.fix_bad_unicode()
.
Users with bad unicode should now directly call ftfy.fix_text()
.ijson
was dropped, and the behavior of textacy.read_json()
is now simpler and consistent with other functions for line-delimited data.mwparserfromhell
was dropped, since the reworked Wikipedia
dataset
no longer requires complicated and slow parsing of wiki markup.Renamed certain functions and variables for clarity, and for consistency with existing conventions:
textacy.load_spacy()
=> textacy.load_spacy_lang()
textacy.extract.named_entities()
=> textacy.extract.entities()
textacy.data_dir
=> textacy.DEFAULT_DATA_DIR
filename
=> filepath
and dirname
=> dirpath
when specifying
full paths to files/dirs on disk, and textacy.io.utils.get_filenames()
=> textacy.io.utils.get_filepaths()
accordinglySpacyDoc
=> Doc
, SpacySpan
=> Span
, SpacyToken
=> Token
,
SpacyLang
=> Language
as variables and in docsRE_
Removed deprecated functionality
spacy_utils.py
and spacy_pipelines.py
are gone;
use equivalent functionality in the spacier
subpackage insteadmath_utils.py
is gone; it was long neglected, and never actually usedReplaced textacy.compat.bytes_to_unicode()
and textacy.compat.unicode_to_bytes()
with textacy.compat.to_unicode()
and textacy.compat.to_bytes()
, which
are safer and accept either binary or text strings as input.
Moved and renamed language detection functionality,
textacy.text_utils.detect_language()
=> textacy.lang_utils.detect_lang()
.
The idea is to add more/better lang-related functionality here in the future.
Updated and cleaned up documentation throughout the code base.
Added and refactored many tests, for both new and old functionality, significantly increasing test coverage while significantly reducing run-time. Also, added a proper coverage report to CI builds. This should help prevent future errors and inspire better test-writing.
Bumped the minimum required spaCy version: v2.0.0
=> v2.0.12
,
for access to their full set of custom extension functionality.
Fixed:
os.path.isfile()
or os.path.isdir()
, rather than os.path.exists()
.