Pdf Crawler Save

SimFin's open source PDF crawler

Project README

PDF Crawler

This is SimFin's open source PDF crawler. Can be used to crawl all PDFs from a website.

You specify a starting page and all pages that link from that page are crawled (ignoring links that lead to other pages, while still fetching PDFs that are linked on the original page but hosted on a different domain).

Can crawl files "hidden" with javascript too (the crawler can render the page and click on all elements to make new links appear).

Built in proxy support.

We use this crawler to gather PDFs from company websites to find financial reports that are then uploaded to SimFin, but can be used for other documents too.

Development

How to install pdf-extractor for development.

$ git clone https://github.com/SimFin/pdf-crawler.git
$ cd pdf-crawler

# Make a virtual environment with the tool of your choice. Please use Python version 3.6+
# Here an example based on pyenv:
$ pyenv virtualenv 3.6.6 pdf-crawler

$ pip install -e .

Usage Example

After having installed pdf-crawler as described in the "Development" section, you can import and use the crawler class like so:

import crawler

crawler.crawl(url="https://simfin.com/crawlingtest/",output_dir="crawling_test",method="rendered-all")

Parameters

url - the url to crawl
output_dir - the directory where the files should be saved
method - the method to use for the crawling, has 3 possible values: normal (plain HTML crawling), rendered (renders the HTML page, so that frontend SPA frameworks like Angular, Vue etc. get read properly) and rendered-all (renders the HTML page and clicks on all elements that can be clicked on (buttons etc.) to make appear links that are hidden somewhere)
depth - the "depth" to crawl, refers to the number of sub-pages the crawler goes to before it stops. Default is 2.
gecko_path - if you choose the crawling method "rendered-all", you have to install Firefox's headless browser Gecko. You can specify the location to the executable that you downloaded here.

License

Available under MIT license

Credits

@gwaramadze, @q7v6rhgfzc8tnj3d, @thf24

Open Source Agenda is not affiliated with "Pdf Crawler" Project. README Source: SimFin/pdf-crawler

Stars

112

Open Issues

Last Commit

4 years ago

Repository

SimFin/pdf-crawler

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/pdf-crawler"><img src="https://www.opensourceagenda.com/projects/pdf-crawler/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022