Textpipe: clean and extract metadata from text
THIS REPOSITORY IS NO LONGER MAINTAINED
textpipe
is a Python package for converting raw text in to clean, readable text and
extracting metadata from that text. Its functionalities include transforming
raw text into readable text by removing HTML tags and extracting
metadata such as the number of words and named entities from the text.
HTML
and other unreadable constructsIt is recommended that you install textpipe using a virtual environment.
First, create your virtual environment using virtualenv or virtualenvwrapper.
Using Venv if your default interpreter is python3.6
python3 -m venv .venv
virtualenv venv -p python3.6
mkvirtualenv textpipe -p python3.6
pip install textpipe
pip install -r requirements.txt
While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.
>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2
>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}
In order to extend the existing Textpipe operations with your own proprietary operations;
test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
return 1
custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))
See CONTRIBUTING for guidelines for contributors.
0.12.1
0.12.0
0.11.9
ents
properties0.11.8
cats
attribute0.11.7
0.11.6
0.11.5
0.11.4
0.11.1
0.11.0
0.9.0
0.8.6
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.2
0.7.0
context
kwargregister_operation
in pipeline