Scipdf Parser Save

Python PDF parser for scientific publications: content and figures

Project README

SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/titipata/scipdf_parser

Note

We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version

Usage

Run the GROBID using the given bash script before parsing PDF.

NOTE: the recommended way to run grobid is via docker, so make sure it's running on your machine. Update the script so that you are using latest version. Generally, at every version there are substantial improvements.

bash serve_grobid.sh

This script will run GROBID at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
 
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID

To parse figures from PDF using pdffigures2, you can run

scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

Open Source Agenda is not affiliated with "Scipdf Parser" Project. README Source: titipata/scipdf_parser

Stars

297

Open Issues

Last Commit

2 months ago

Repository

titipata/scipdf_parser

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/scipdf-parser"><img src="https://www.opensourceagenda.com/projects/scipdf-parser/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022