The simple, easy to use command line web crawler.
Much update!
<base>
tags to grab links that wouldn't have been recognized before. Thanks lxml!Content-Length
header.spidy.zip
contains just crawler.py
and config/
, while the source code archives contain all files.
Final 1.3.0 release. Added error handling back in - no changes needed.
Optimized all file creation and loading. Everything is now saved with UTF-8 encoding, allowing for foreign characters and EMOJI in pages.
Optimized all file creation and loading. Everything is now saved with UTF-8 encoding, allowing for foreign characters and EMOJI in pages.
In Alpha as the error-handling system is being slightly redesigned. Still functional however!
Added domain restrictions. Crawling can now be limited to a certain domain, such as wsj.com
, https://www.wsj.com
, or https://www.wsj.com/article
. Can be set when entering configuration settings or in the config files.
Also more bugfixes and MIME types because those are cool.
The first official release of spidy! A GUI is in the works, as well as many more awesome features.
spidy.zip
contains only the files necessary to run the crawler, while the source code downloads contain all the things.