Data parameter option conversion issue for crawlData API.
🐞 漏洞修复
crawlData API 的 data 参数选项转换问题。
v7.1.1
11 months ago
🐞 Bug fixes
Correctly handle the header of the post method configured by the crawlData API.
🐞 漏洞修复
正确处理 crawlData API 配置 post 方法的 header 。
v7.1.0
1 year ago
🚀 Features
The flexibility of crawlFile API has been upgraded again, mainly adjusting fileName, storeDir, and extension.
The storeDir and extension of the advanced writing method (CrawlFileAdvancedConfig) are changed to storeDirs and extensions respectively, and the type is string or (string | null)[], and the fileNames option is added, and the type is (string | null)[] . If it is an array, it will be assigned to the crawling targets in order.
The fileName of the detailed target notation (CrawlFileDetailTargetConfig) adds a null type, which is used to use the default file name instead of the fileName corresponding to the advanced notation (CrawlFileAdvancedConfig) fileNames.
🚀 特征
crawlFile API 灵活度再次升级,主要对 fileName、storeDir、extension 进行调整。
The params configuration option for the crawlData API is not working.
🐞 漏洞修复
crawlData API 的 params 配置选项不起作用。
v7.0.0
1 year ago
🚨 Breaking Changes
Fingerprint upgrade:
The fingerprint of the advanced writing method is renamed to fingerprints, which is an array writing method, which stores objects of the DetailTargetFingerprintCommon type, which is convenient for customization. Internally, the objects inside will be randomly assigned to the target.
Adjustment of crawlPage fingerprint options: the maximum width and height of the fingerprint configuration of advanced writing and detailed target writing are changed to optional.
Proxy upgrade: create a crawler instance, change the proxy of the advanced writing method and the detailed target writing method to the object writing method, with three attributes: urls, switchByHttpStatus and switchByErrorCount, urls can set multiple proxy URLs, and the internal default uses the first one first, switchByHttpStatus Set which non-compliant response status codes need to switch the proxy, and switchByErrorCount sets how many times the proxy needs to be switched when errors such as timeouts arrive. The proxy rotation feature needs to be used with error retries.
Return value type adjustment: CrawlCommonRes, CrawlPageSingleRes, CrawlDataSingleRes and CrawlFileSingleRes are renamed to CrawlCommonResult, CrawlPageSingleResult, CrawlDataSingleResult and CrawlFileSingleResult respectively
🚀 Features
It is possible to cancel the configuration of the upper-level unified setting by setting null in the option.
The userAgent option in DetailTargetFingerprintCommon overrides the object notation and allows customization of the maximum and minimum values of the major version, minor version, and revision number inside. Each crawl target gets a new userAgent .
A new proxyDetails property is added to the crawling results to record the proxy status.
Added 'random' attribute value to mobile option of fingerprint configuration, allowing internal randomization.
Terminal prompts are simplified and color adjusted.
🐞 Bug fixes
Unable to create multiple levels of non-existent folders on linux systems.
For configuration, major changes have been made to each crawling API configuration, and the same API supports more crawling configuration methods, each of which has its own significance.
For the result, the result of each request will be wrapped in an object, which provides information about the result of this request, such as: id, result, success, maximum retry, number of retries, collected error information, etc. . Automatically determine whether the return value is wrapped in an array according to the configuration method you choose, and the type is perfectly matched in TS.
For obtaining results through the callback function, the callback is no longer executed after a single request is completed like the v4.x version, but will be executed sequentially after the crawling is completed, which will not block subsequent crawling.
🚀 Features
Added a retry mechanism, which can be set for all crawling requests, for a single crawling request, and for a single request.
A new priority queue is added to use priority crawling according to the priority of a single request.
For more configurations that may be reused, you can set the baseConfig settings passed in when requesting configuration, API crawling configuration, and generating crawler instances, such as: timeout, proxy, intervalTime, etc., and the weight is: requestConfig > APIConfig > baseConfig.
For crawlFile API, file path, name, suffix and other information can be set individually for each file. Added the beforeSave life cycle function before saving the file. You can get the file data of the Buffer type, and you can perform operations such as compression on the data in the callback. The returned new Buffer data will replace the original data and write it into the file.
Update the output of crawling on the console, and collect the error information generated by crawling into an error queue. After the crawling is completed, you can get the error message queue through the return value.
🚨 重大改变
对于配置,每个爬取 API 配置发生重大改变,同一个 API 支持更多爬取配置方式,每种方式都有其存在的意义。
About the result processing of each crawling target: it will start processing after a single target is completed, saving time and improving performance. Originally, it waited for all targets to be completed before processing, and there would be free time during the crawling process.
About the execution timing of the second parameter callback function of the crawlPage, crawlData, and crawlFile APIs: it will be executed at the end, and the result obtained is the same as the result of the Promise method.
About the type: PageRequestConfig, DataRequestConfig and FileRequestConfig are changed to CrawlPageDetailTargetConfig, CrawlDataDetailTargetConfig and CrawlFileDetailTargetConfig respectively, the purpose is to not only add the configuration of the request, but also expand more, called detailed target usage. CrawlPageConfigObject, CrawlDataConfigObject, and CrawlFileConfigObject changed to CrawlPageAdvancedConfig, CrawlDataAdvancedConfig, and CrawlFileAdvancedConfig respectively, named Advanced Usage.
Configuration options in fileConfig of crawlFile: can be set directly in the root object configuration. The beforeSave lifecycle function changed to onBeforeSaveItemFile.
About the object results of crawlPage, crawlData and crawlFile: remove the crawlCount attribute, and get the number of times by retryCount + 1. errorQueue was renamed to crawlErrorQueue.
🚀 Features
Added device fingerprint to avoid identifying and tracking us from different locations through fingerprint recognition. You can use the default with a switch, and if you need to specify it, you can set it uniformly for all crawling targets in the advanced usage, or you can specify the settings through the detailed target usage.
Adding multiple attributes for each advanced usage can be configured in an advanced way to set the object uniformly, without having to set it repeatedly for each target configuration. The new onCrawlItemComplete lifecycle function will be executed after each crawling goal is completed, and the result of the crawling goal will be passed to the callback function.
Added crawlPage in the configuration of creating a crawler application, you can set the configuration of creating a browser in the crawlPage.launchBrowser option (type is PuppeteerLaunchOptions from Puppeteer).
crawlPage adds viewport option, which is used to set the viewport of the page.