Cso Classifier Versions Save

Python library that classifies content from scientific papers with the topics of the Computer Science Ontology (CSO).

v3.1

1 year ago

This release brings in two main changes. The first change is related to the library (and the code) to perform the Levenshtein similarity. Before we relied on python-Levenshtein which required python3-devel. This new version uses rapidfuzz which as fast as the previous library and it is much easier to install on the various systems. The second change is related to an updated list of dependencies. We updated some libraries including igraph.

3.0

2 years ago

This release welcomes some improvements under the hood. In particular:

  • we refactored the code, reorganising scripts into more elegant classes
  • we added functionalities to automatically setup and update the classifier to the latest version of CSO
  • we added the explanation feature, which returns chunks of text that allowed the classifier to infer a given topic
  • the syntactic module takes now advantage of Spacy POS tagger (as previously done only by semantic module)
  • the grammar for the chunk parser is now more robust: {<JJ.*>*<HYPH>*<JJ.*>*<HYPH>*<NN.*>*<HYPH>*<NN.*>+}

In addition, in the post-processing module, we added the outlier detection component. This component improves the accuracy of the result set, by removing erroneous topics that were conceptually distant from the others. This component is enabled by default and can be disabled by setting delete_outliers = False when calling the CSO Classifier (see Parameters).

Please, be aware that having substantially restructured the code into classes, the way of running the classifier has changed too. Thus, if you are using a previous version of the classifier, we encourage you to update it (pip install -U cso-classifier) and modify your calls to the classifier, accordingly. Read our usage examples.

We would like to thank James Dunham @jamesdunham from CSET (Georgetown University) for suggesting to us how to improve the code.

v2.3.2

4 years ago

Version alignement with Pypi. Similar to version 2.3.1.

v2.3.1

4 years ago

Bug Fix. Added some exception handles

v2.3

4 years ago

This new release, contains a bug fix and the latest version of the CSO ontology.

Bug Fix: When running in batch mode, the classifier was treating the keyword field as an array instead of string. In this way, instead of processing keywords (separated by comma), it was processing each single letters, hence inferring wrong topics. This now has been fixed. In addition, if the keyword field is actually an array, the classifier will first 'stringify' it and then process it.

We also downloaded and packed the latest version of the CSO ontology.

v2.2

4 years ago

In this version (release v2.2), we (i) updated the requirements needed to run the classifier, (ii) removed all unnecessary warnings, and (iii) enabled multiprocessing. In particular, we removed all useless requirements that were installed in development mode, by cleaning the requirements.txt file.

When computing certain research papers, the classifier display warnings raised by the kneed library. Since the classifier can automatically adapt to such warnings, we decided to hide them and prevent users from being concerned about such outcome.

This version of the classifier provides improved scalablibility through multiprocessing. Once the number of workers is set (i.e. num_workers >= 1), each worker will be given a copy of the CSO Classifier with a chunk of the corpus to process. Then, the results will be aggregated once all processes are completed. Please be aware that this function is only available in batch mode.

v2.1

5 years ago

The CSO Classifier is an application that takes as input the text from abstract, title, and keywords of a research paper and outputs a list of relevant concepts from CSO. This new release (version v2.1) aims at improving its scalability. Compared to its previous version (v2.0), the classifier relies on a cached word2vec model which connects the words within the model vocabulary directly with the CSO topics. Thanks to this cache, the classifier is able to quickly retrieve all CSO topics that could be inferred by given tokens, speeding up the processing time. In addition, this cache is lighter (~64M) compared to the actual word2vec model (~366MB), which allows to save additional time at loading time. Thanks to this improvement the CSO Classifier is around 24x faster and can be easily run on large corpus of scholarly data.

v2.0

5 years ago

Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this repository, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of research areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.

v1.0

5 years ago

The CSO Classifier is an application that classifies the content of scientific papers (i.e., full-text, abstract, and title) according to CSO. Specifically, given a research paper, the classifier takes as input text from its abstract or full-text and outputs a list of relevant concepts from CSO. It does so by mapping the n-grams in the text to concepts in the CSO and then inferring their super concepts. It accepts four optional parameters:

min_similarity, which controls the minimum similarity value for mapping n-grams to concepts. infer_super_topics, which controls whether the classifier will try to infer, given a topic (e.g., Linked Data), only the direct super-topics (e.g., Semantic Web) or all its super-topics (e.g., Semantic Web, WWW, Computer Science). num_children, which controls the number of concepts necessary for inferring a super concept. For example, when this factor is set to three, the topic Semantic Web will be inferred if at least three of its sub-topics (e.g., OWL, RDF, Linked Data) are present. verbose, is a flag controlling the verbosity level of the result. The CSO Classifier removes English stop words and it gathers together unigrams, bigrams and trigrams. Then, for each n-gram, it computes the Levenshtein similarity with the labels of the topics in CSO. Research topics having similarity equal or higher than the minimum similarity threshold with an n-gram, are added to the final set of topics. In order to further enrich the set of inferred topics, the CSO Classifier infers also their super topics by exploiting the skos:broaderGeneric relationships within the CSO [1]. The output of this process can contain equivalent topics linked by relatedEquivalent relationships in CSO, e.g., Ontology Matching and Ontology Mapping. Therefore, the CSO Classifier also clean up these redundant concepts by preserving only one of them.

The algorithm produces two kinds of result, depending on the verbose parameter. When it is set to true, the algorithm returns a detailed list of topics, with the matched n-grams and the evaluated similarity scores. Conversely, if verbose is set to false, the algorithm returns a more synthetic list of topics.

More info: http://oro.open.ac.uk/55908/