Pyspider Versions Save

A Powerful Spider(Web Crawler) System in Python.

v0.3.10

6 years ago

New features:

add phantomjs proxy support #692 @volvofixthis
support redis 3.x in cluster mode for message queue @hackty

Fix several bugs:

Improve the performance of counter.to_dict
Fixed issue of counter changed during read
Fix tornado version dependency in setup.py

v0.3.9

7 years ago

New features:

Support for Python 3.6.
Auto Pause: the project will be paused for scheduler.PAUSE_TIME (default: 5min) when last scheduler.FAIL_PAUSE_NUM (default: 10) task failed, and dispatch scheduler.UNPAUSE_CHECK_NUM (default: 3) tasks after scheduler.PAUSE_TIME. Project will resume if any one of last scheduler.UNPAUSE_CHECK_NUM tasks success.
Each callback now have a default 30s process time limit. (Platform support required) @beader
New Javascript render engine - Splash support: Enabled by fetch argument --splash-endpoint=http://splash:8050/execute
Python3 webdav support.
Python3 from projects import project support.
A link to corresponding task is added to webui debug page when debugging a exists task in webui.
New user_agent parameter in self.crawl, you can set user-agent by headers though.

Fix several bugs:

New webui dashboard frontend framework - vue.js, improved the performance when having large number of tasks (e.g. http://demo.pyspider.org/)
Fix crawl_config doesn't work in webui while debugging a script issue.
Fix CSS Selector Helper doesn't work issue. @ackalker
Fix connection_timeout not working issue.
FIx need_auth option not applied on webdav issue.
Fix "fix can't dump counter to file: scheduler.all" error.
Some other fixes

v0.3.8

7 years ago

New features:

Now you can use cancel to stop an active task of a task with auto_recrawl enabled.
Handler.crawl_config will now be applied to the task when fetching. (It's applied when the task created before, that means proxy/headers can be changed afterward). See http://docs.pyspider.org/en/latest/apis/self.crawl/#handlercrawl_config

Fix several bugs:

* Fixed a global config object thread interference issue, which may cause connect to scheduler rpc error: error(10061, '') error when all --run-in=thread (default in windows platform)
Fix response.save lost when fetch failed issue
Fix potential scheduler failure caused by old version of six
Fix result dump return nothing when using mongodb backend

v0.3.7

8 years ago

ThreadBaseScheduler added to improve the performance of scheduler
robots.txt supported!
elasticsearch database backend supported!
new script callback on_finished, http://docs.pyspider.org/en/latest/About-Projects/#on_finished-callback
you can now set the delay time between retries:

retry_delay is a dict to specify retry intervals. The items in the dict are {retried: seconds}, and a special key: '' (empty string) is used to specify the default retry delay if not specified.

dict parameters in crawl_config, @config will be merged (e.g. headers), thanks to @ihipop
add parameter max_redirects in self.crawl to control maximum redirect numbers when doing the fetch, thanks to @AtaLuZiK
add parameter validate_cert in self.crawl to ignore the error of server’s certificate.
new property etree for Response, etree is a cached lxml.html.HtmlElement object, thanks to @waveyeung
you can now pass arguments to phantomjs from command line or config file.
support for pymongo 3.0
local.projectdb now accept a glob path (e.g. script/*.py) to load multiple projects from local filesystem.
queue size in the dashboard is not working for osx, thanks to @xyb
counters in dashboard will shown for stopped projects
other bug fix

v0.3.6

8 years ago

NEW: webdav mode, now you can use webdav to mount project folder to your local filesystem and edit scripts with your favority editor! (not support python 3, wsgidav required, which is not contained in setup.py)
bug fixes for Python 3 compatibility, Postgresql, flask-Login>=0.3.0, typo and more, thanks for the help of @lushl9301 @hitjackma @exoticknight @d0ugal @qiang.luo @twinmegami @jttoday @machinewu @littlezz @yaokaige
fix Queue.qsize NotImplementedError on Mac OS X, thanks @xyb

v0.3.5

8 years ago

New parameter: auto_recrawl - auto restart task every age.
New parameter: js_viewport_width/js_viewport_height to set viewport size for phantomjs engine.
New command line option to set different message queue backends with URI scheme.
New task level storage mechanism: self.save
New redis taskdb
New redis message queue.
New high level message queue interface kombu.
Fix bugs related to mongodb (keyword missing if not set).
Fix phantomjs not work in all mode.
Fix a potential deadlock in processor send_message.
Default log level of scheduler is changed to INFO

v0.3.4

9 years ago

Global

New message queue support: beanstalkd by @tiancheng91
New global argument: --logging-config to specify a customization logging config (to disable werkzeug logs for instance). You can get a sample config from pyspider/logging.conf).
Project group info is added to task package now.
Change docker base image to cmfatih/phantomjs, you can use phantomjs with same docker image now.
Auto restart phantomjs if crash, only enabled in all mode by default.

WebUI

Show next exetime of a task in task page.
Show fetch time and process time in tasks page.
Show average fetch time and process time in 5min in dashboard page.
Show message queue status in dashboard page.
limit and offset parameter support in result dump.
Fix frontend bug when crawling pages with dataurl.

Other

Fix support for phantomjs 2.0.
Fix scheduler project update inform not work, and use md5sum of script as another signal.
Scheduler: periodic counter report in log.
Fetcher: fix for legacy version of pycurl

v0.3.3

9 years ago

API

self.crawl will raise TypeError when get unexcepted arguments
self.crawl not accapt cURL command as first argument, see http://docs.pyspider.org/en/latest/apis/self.crawl/#curl-command.

WEBUI

A new css selector tool bar is added, the pre-generated css selected pattern can be modified and added/copy to script.

Benchmarking

The database table for bench test will be cleared before and after bench test.
insert/update/get bench test for database and put/get test for message queue is added.

Other

The default message queue is switched to ampq.
docs fix.

v0.3.2

9 years ago

Scheduler

The size of task queue is more accurate now, you can use it to determine all done status of scheduler.

Fetcher

Fix tornado loss cookies while doing 30x redirects
You can use cookies with cookie header at same time now
Fix proxy not working bug.
Enable proxy by default.
Proxy now support username and password authorization. @soloradish
Etag and Last-Modified header will be disabled while last crawl is failed.

Databases

MySQL default engine changed to InnoDB @laapsaap
MySQL, larger result column size, changed to MEDIUMBLOB(up to 16M) @laapsaap

WebUI

WebUI will use same arguments as the fetcher, fix proxy not word for webui bug.
Results will be sorted in the order of updatetime.

One Mode

Script exception logs would be printed to screen

New Command `send_message`

You can use the command pyspider send_message [project] [message] to send a message to project via command-line.

Other

Using localhosted test web pages
Remove version specify of lxml, you can use apt-get to install any version of lxml

v0.3.1

9 years ago

One Mode

One mode not only means all-in-one, it runs every thing in one process over tornado.ioloop. One mode is designed for debug purpose. You can test scripts written in local files and using --interactive to choose a task to be tested.

With one mode you can use pyspider.libs.utils.python_console() to open an interactive shell in your script context to test your code.

full documentation: http://docs.pyspider.org/en/latest/Command-Line/#one

bug fix

Pyspider Versions Save

v0.3.10

New features:

Fix several bugs:

v0.3.9

New features:

Fix several bugs:

v0.3.8

New features:

Fix several bugs:

v0.3.7

v0.3.6

v0.3.5

v0.3.4

Global

WebUI

Other

v0.3.3

API

WEBUI

Benchmarking

Other

v0.3.2

Scheduler

Fetcher

Databases

WebUI

One Mode

New Command send_message

Other

v0.3.1

One Mode

New Command `send_message`