Scispacy Versions Save

A full spaCy pipeline and models for scientific/biomedical documents.

v0.5.4

2 months ago

Update for spacy 3.7.x

What's Changed

New Contributors

Full Changelog: https://github.com/allenai/scispacy/compare/v0.5.3...v0.5.4

v0.5.3

7 months ago

Retrains the models with spacy 3.6.x to be compatible with the latest spacy version

What's Changed

New Contributors

Full Changelog: https://github.com/allenai/scispacy/compare/v0.5.2...v0.5.3

v0.5.2

1 year ago

This release includes an update of the entity linkers to use the latest UMLS release (2022AB), which includes information about newer entities like COVID-19.

In [10]: doc = nlp("COVID-19 is a global pandemic.")

In [11]: linker = nlp.get_pipe('scispacy_linker')

In [12]: linker.kb.cui_to_entity[doc.ents[0]._.kb_ents[0][0]]
Out[12]:
CUI: C5203670, Name: COVID19 (disease)
Definition: A viral disorder generally characterized by high FEVER; COUGH; DYSPNEA; CHILLS; PERSISTENT TREMOR; MUSCLE PAIN; HEADACHE; SORE THROAT; a new loss of taste and/or smell (see AGEUSIA and ANOSMIA) and other symptoms of a VIRAL PNEUMONIA. In severe cases, a myriad of coagulopathy associated symptoms often correlating with COVID-19 severity is seen (e.g., BLOOD COAGULATION; THROMBOSIS; ACUTE RESPIRATORY DISTRESS SYNDROME; SEIZURES; HEART ATTACK; STROKE; multiple CEREBRAL INFARCTIONS; KIDNEY FAILURE; catastrophic ANTIPHOSPHOLIPID ANTIBODY SYNDROME and/or DISSEMINATED INTRAVASCULAR COAGULATION). In younger patients, rare inflammatory syndromes are sometimes associated with COVID-19 (e.g., atypical KAWASAKI SYNDROME; TOXIC SHOCK SYNDROME; pediatric multisystem inflammatory disease; and CYTOKINE STORM SYNDROME). A coronavirus, SARS-CoV-2, in the genus BETACORONAVIRUS is the causative agent.
TUI(s): T047
Aliases (abbreviated, total: 47):
         2019 Novel Coronavirus Infection, SARS-CoV-2 Disease, Human Coronavirus 2019 Infection, SARS-CoV-2 Infection, Disease caused by severe acute respiratory syndrome coronavirus 2 (disorder), Disease caused by SARS-CoV-2, 2019 nCoV Disease, 2019 Novel Coronavirus Disease, COVID-19 Virus Disease, Virus Disease, COVID-19

It also includes a small bug fix to the abbreviation detector.

Note: The models (e.g. en_core_sci_sm) are still labeled as version v0.5.1, as this release did not involve retraining the base models, only the entity linkers.

What's Changed

New Contributors

Full Changelog: https://github.com/allenai/scispacy/compare/v0.5.1...v0.5.2

v0.5.1

1 year ago

Retrains the models with spacy 3.4.x to be compatible with the latest spacy version

v0.5.0

2 years ago

Updates scispacy to be compatiable with the latest spacy version (3.2.3)

v0.4.0

3 years ago

This release of scispacy is compatible with Spacy 3. It also includes a new model 🥳 , en_core_sci_scibert, which uses scibert base uncased to do parsing and POS tagging (but not NER, yet. This will come in a later release).

v0.3.0

3 years ago

New Features

Hearst Patterns

This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.

Passing extended=True to the HyponymDetector will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

  • The relation rule used to extract the hyponym (type: str)
  • The more general concept (type: spacy.Span)
  • The more specific concept (type: spacy.Span)

Usage:

import spacy
from scispacy.hyponym_detector import HyponymDetector

nlp = spacy.load("en_core_sci_sm")
hyponym_pipe = HyponymDetector(nlp, extended=True)
nlp.add_pipe(hyponym_pipe, last=True)

doc = nlp("Keystone plant species such as fig trees are good for the soil.")

print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]

Ontonotes Mixin: Clear Format > UD

Thanks to Yoav Goldberg for this fix! Yoav noticed that the dependency labels for the Onotonotes data use a different format than the converted GENIA Trees. Yoav wrote some scripts to convert between them, including normalising of some syntactic phenomena that were being treated inconsistently between the two corpora.

Bug Fixes

#252 - removed duplicated aliases in the entity linkers, reducing the size of the UMLS linker by ~10% #249 - fix the path to the rxnorm linker

v0.2.5

3 years ago

New Features 🥇

New Models

  • Models compatible with Spacy 2.3.0 🥳

Entity Linkers

#246, #233

  • Updated the UMLS KB to use the 2020AA release, categories 0,1,2,9.

  • umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.

  • mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derrived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.

  • rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.

  • go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.

  • hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

Bug Fixes 🐛

#217 - Fixes a bug in the Abbreviation detector

API Changes

  • Entity Linkers now modify the Span._.kb_ents rather than the Span._.umls_ents to reflect the fact that we now have more than one entity linker. Span._.umls_ents will be deprecated in v1.0.

v0.2.4

4 years ago

Retrains the models to be compatible with spacy 2.2.1 and rewrites the optional sentence splitting pipe to use pysbd. This pipe is experimental at this point and may be rough around the edges.

v0.2.2

4 years ago

Adds entity linking and abbreviation detection.