🗃 Open source self-hosted web archiving. Takes URLs/browser history/book...
Heritrix is the Internet Archive's open-source, extensible, web-scale, a...
Collect and revisit web pages.
The archivist's web crawler: WARC output, dashboard for all crawls, dyna...
InterPlanetary Wayback: A distributed and persistent archive replay syst...
Run a high-fidelity browser-based crawler in a single Docker container
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron...
WarcDB: Web crawl data as SQLite databases.
Streaming WARC/ARC library for fast web archive IO
:whale2: Web Archiving Integration Layer: One-Click User Instigated Pres...
Bitextor generates translation memories from multilingual websites
Chrome extension to "Create WARC files from any webpage"
CoCrawler is a versatile web crawler built using modern tools and concur...
A toolkit for CDX indices such as Common Crawl and the Internet Archive'...
An Apache Spark framework for easy data processing, extraction as well a...