Module for automatic summarization of text documents and HTML pages.
Python & command-line tool to gather text on the Web: web crawling/scrap...
Golang PDF library for creating and processing PDF files (pure go)
Tika-Python is a Python binding to the Apache Tika™ REST services allowi...
A general list of resources to image text localization and recognition ...
This repository has moved! https://github.com/unidoc/unipdf
Heuristic based boilerplate removal tool
A self-hosted search engine for documents.
Text Extraction, Rendering and Converting of PDF Documents
A simple library and set of tools for parsing, modifying, and composing ...
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Reworked https://www.readability.com/ parsing library (now https://mercu...
AWS Lambda functions to extract text from various binary formats.
Entity Disambiguation as text extraction (ACL 2022)