Scrapy, a fast high-level web crawling & scraping framework for Python.
Hihglights:
Feed exports now support Google Cloud Storage as a storage backend
The new FEED_EXPORT_BATCH_ITEM_COUNT
setting allows to deliver output items in batches of up to the specified number of items.
It also serves as a workaround for delayed file delivery, which causes Scrapy to only start item delivery after the crawl has finished when using certain storage backends (S3, FTP, and now GCS).
The base implementation of item loaders has been moved into a separate library, itemloaders, allowing usage from outside Scrapy and a separate release schedule
The startproject
command no longer makes unintended changes to the permissions of files in the destination folder, such as removing execution permissions.
Highlights:
TextResponse.json
methodbytes_received
signal that allows canceling response downloadCookiesMiddleware
fixesHighlights:
FEEDS
setting to export to multiple feedsResponse.ip_address
attributeResponse.follow_all
now supports an empty URL iterable as input (#4408, #4420)TWISTED_REACTOR
(#4401, #4406)Highlights:
Revert the fix for #3804 (#3819), which has a few undesired side effects (#3897, #3976).
Enforce lxml 4.3.5 or lower for Python 3.4 (#3912, #3918)
Fix Python 2 support (#3889, #3893, #3896)
Highlights: