Best 24 Text Extraction Open Source Projects

Module for automatic summarization of text documents and HTML pages.

Python & command-line tool to gather text on the Web: web crawling/scrap...

Golang PDF library for creating and processing PDF files (pure go)

Tika-Python is a Python binding to the Apache Tika™ REST services allowi...

A general list of resources to image text localization and recognition ...

This repository has moved! https://github.com/unidoc/unipdf

Heuristic based boilerplate removal tool

A self-hosted search engine for documents.

Text Extraction, Rendering and Converting of PDF Documents

A simple library and set of tools for parsing, modifying, and composing ...

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Reworked https://www.readability.com/ parsing library (now https://mercu...

AWS Lambda functions to extract text from various binary formats.

Entity Disambiguation as text extraction (ACL 2022)