A Powerful Spider(Web Crawler) System in Python.
scheduler.PAUSE_TIME
(default: 5min) when last scheduler.FAIL_PAUSE_NUM
(default: 10) task failed, and dispatch scheduler.UNPAUSE_CHECK_NUM
(default: 3) tasks after scheduler.PAUSE_TIME
. Project will resume if any one of last scheduler.UNPAUSE_CHECK_NUM
tasks success.--splash-endpoint=http://splash:8050/execute
from projects import project
support.user_agent
parameter in self.crawl
, you can set user-agent by headers though.connection_timeout
not working issue.need_auth
option not applied on webdav issue.cancel
to stop an active task of a task with auto_recrawl
enabled.Handler.crawl_config
will now be applied to the task when fetching. (It's applied when the task created before, that means proxy/headers can be changed afterward). See http://docs.pyspider.org/en/latest/apis/self.crawl/#handlercrawl_config
connect to scheduler rpc error: error(10061, '')
error when all --run-in=thread
(default in windows platform)response.save
lost when fetch failed issueon_finished
, http://docs.pyspider.org/en/latest/About-Projects/#on_finished-callback
retry_delay is a dict to specify retry intervals. The items in the dict are {retried: seconds}, and a special key: '' (empty string) is used to specify the default retry delay if not specified.
max_redirects
in self.crawl
to control maximum redirect numbers when doing the fetch, thanks to @AtaLuZiKvalidate_cert
in self.crawl
to ignore the error of server’s certificate.etree
for Response, etree
is a cached lxml.html.HtmlElement object, thanks to @waveyeungage
.self.save
--logging-config
to specify a customization logging config (to disable werkzeug logs for instance). You can get a sample config from pyspider/logging.conf).group
info is added to task package now.exetime
of a task in task page.limit
and offset
parameter support in result dump.send_message
You can use the command pyspider send_message [project] [message]
to send a message to project via command-line.
One mode not only means all-in-one, it runs every thing in one process over tornado.ioloop. One mode is designed for debug purpose. You can test scripts written in local files and using --interactive
to choose a task to be tested.
With one
mode you can use pyspider.libs.utils.python_console()
to open an interactive shell in your script context to test your code.
full documentation: http://docs.pyspider.org/en/latest/Command-Line/#one