SimFin's open source PDF crawler
This is SimFin's open source PDF crawler. Can be used to crawl all PDFs from a website.
You specify a starting page and all pages that link from that page are crawled (ignoring links that lead to other pages, while still fetching PDFs that are linked on the original page but hosted on a different domain).
Can crawl files "hidden" with javascript too (the crawler can render the page and click on all elements to make new links appear).
Built in proxy support.
We use this crawler to gather PDFs from company websites to find financial reports that are then uploaded to SimFin, but can be used for other documents too.
How to install pdf-extractor for development.
$ git clone https://github.com/SimFin/pdf-crawler.git
$ cd pdf-crawler
# Make a virtual environment with the tool of your choice. Please use Python version 3.6+
# Here an example based on pyenv:
$ pyenv virtualenv 3.6.6 pdf-crawler
$ pip install -e .
After having installed pdf-crawler as described in the "Development" section, you can import and use the crawler class like so:
import crawler
crawler.crawl(url="https://simfin.com/crawlingtest/",output_dir="crawling_test",method="rendered-all")
Available under MIT license