Transistor, a Python web scraping framework for intelligent use cases.
If following the example in documentation using newt.db then we must use RelStorage<=2.1.1 during install.
Fixed a bug in BaseWorker.load_items()
method which previously resulted
in losing scrape data when the number of workers did not equal the number
of tasks. Now, using any number of workers or pool size will result in
consistent export/save results. While scrape time will change proportional
to the number of workers assigned. Wrote tests to ensure the same.
Added url
parameter to the WorkGroup
which is a bit more attractive
API, instead of including the URL in a kwarg. The reason why the URL was
originally included as a kwarg is that depending on how the custom
Spider
is set up, the URL may already be specified, and it is redundant to
specify it again. But for API clarity sake, now we just insist the URL is
specified in the WorkGroup
. At least, it is easier to read at a quick glance.
Many API breaking changes. See README at https://github.com/bomquote/transistor/blob/master/CHANGES
auth
, baseurl
, browser
, cookies
,
crawlera_user
, http_session_timeout
, http_session_valid
, LUA_SOURCE
,
max_retries
, name
, number
, referrer
, searchurl
, splash_args
, user_agent
.**kwargs
if desired**kwarg
called keywords
to set
the name of the spreadsheet column heading which contains the target search terms.
For example: keywords='titles'
or keywords='part_numbers'
. Defaults to "item".