Pdf Text Extraction Benchmark Save

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

Project README

A Benchmark & Evaluation for Text Extraction from PDF

This project is about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles. It provides (1) a benchmark generator, (2) a ready-to-use benchmark and (3) an extensive evaluation, with meaningful evaluation criteria.

The Benchmark Generator

constructs high-quality benchmarks from TeX source files.
identifies the following 16 logical text blocks: title, author(s), affiliation(s), date, abstract, headings, paragraphs of the body text, formulas, figures, tables, captions, listing-items, footnotes, acknowledgements, references, appendices.
serializes desired logical text blocks to plain text, XML or JSON format.

For more details and usage, see benchmark-generator/.

The Benchmark

consists of 12,099 ground truth files and 12,099 PDF files of scientific articles, randomly selected from arXiv.org. Each ground truth file contains the title, the headings and the body text paragraphs of a particular scientific article.
was generated using the benchmark generated above.

For more details, see benchmark/.

The Evaluation

assesses the following 13 PDF extraction tools: pdftotext, pdftohtml, pdf2xml (Xerox), pdf2xml (Tiedemann), PdfBox, ParsCit, LA-PdfText, PdfMiner, pdfXtk, pdf-extract, PDFExtract, Grobid, Icecite.
provides meaningful evaluation criteria in order to assess the semantic abilities of a tool on identifying (1) words, (2) the reading order, (3) paragraph boundaries and (4) the semantic roles of text elements in PDF.

For more details, see evaluation/.

Open Source Agenda is not affiliated with "Pdf Text Extraction Benchmark" Project. README Source: ckorzen/pdf-text-extraction-benchmark

Stars

Open Issues

Last Commit

3 years ago

Repository

ckorzen/pdf-text-extraction-benchmark

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/pdf-text-extraction-benchmark"><img src="https://www.opensourceagenda.com/projects/pdf-text-extraction-benchmark/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022