Webmagic Versions Save

A scalable web crawler framework for Java.

webmagic-parent-0.6.1

7 years ago

本次更新修复了一些0.6.0的问题和一些小优化。

修改默认策略为信任所有https证书 #444 @ckex
修复使用startUrls添加url时，如果使用了cookie会出现空指针的问题 #438
PhantomJSDownloader支持crawl.js路径自定义 #414 @jsbd
POST请求支持302跳转 #443 @xbynet

注：默认信任所有证书会有内容伪造的风险，但是考虑到爬虫的便利性还是加上了，使用者需要自己判断内容安全性。

WebMagic-0.6.0

7 years ago

此次更新主要是一些依赖包的升级和bugfix。

#290 代理增加用户名密码认证 @hepan
#194 重构代理池部分代码，支持自定义代理池 @EdwardsBean
#314 修复低版本json-path依赖2.x的StringUtils导致报错的问题
#380 升级fastjson 到1.2.21
#301 修复JsonPath在注解模式不可用的问题 @Salon-sai
#377 修复监控模块在URL带有端口时会报错的问题
#400 修复FileCacheQueueScheduler的NullPointerException问题
#407 为PhantomJSDownloader添加新的构造函数，支持phantomjs自定义命令 @jsbd
#419 修复抓取https链接线程无法结束导致进程一直运行的问题 @cpaladin
#374 升级HttpClient到4.5.2，修复一些安全问题
#424 去掉Guava依赖

因为Guava不同版本兼容性不好，经常导致demo无法运行，所以我最后决定去掉了Guava的依赖。如果使用了BloomFilterDuplicateRemover的用户，需要手动依赖Guava包。
#426 去掉Avalon相关包

Avalon是之前计划的一站式抓取平台。因为有个朋友基于WebMagic做了类似的实现Gather Platform，所以Avalon放弃了，转而支持这个项目。WebMagic核心会专注于做应用内的框架。

webmagic-0.5.3

8 years ago

时隔一年半，作者终于回归了。这个版本主要解决之前的一些BUG，后续会慢慢的继续完善功能。

升级Xsoup到0.3.1，支持//div[contains(@id,'te')]语法。
#245 升级Jsoup到1.8.3，解决n-th selector二进制不兼容的问题。
#139 修复JsonFilePipeline保存路径的问题
#144 修复@TargetUrl增加SourceRegion后取不到链接的问题
#157 修复FileCacheQueueScheduler中去重偶尔不工作的问题 @zhugw
#188 增加重试的间隔时间，默认为1秒 [@edwardsbean](ht[tps* //github.com/edwardsbean)
#193 修复分页功能MultiPagePipeline可能出现的并发问题 edwardsbean
#198 修正site.setHttpProxy()不起作用的bug @okuc

WebMagic-0.5.2

9 years ago

此次主要包括对于Selector部分的重构，以及一些功能的完善和一些Bugfix。

重构了Selector部分，使得结构更清晰，并且能够更好的支持链式的XPath抽取了。 [Issue #113]

支持对于选取出来的结果，进行外部迭代。例如：

List<Selectable> divs = html.xpath("//div").nodes();
for (Selectable div : divs) {
    System.out.println(div.xpath("//h2").get());
}

增强自动编码识别机制，现在除了从HTTP头中，还会从HTML Meta信息中判断编码，感谢@fengwuze @sebastian1118提交代码和建议。[Issue #126]
升级Xsoup版本为0.2.4，增加了判断XPath最终抽取结果(是元素还是属性)的API，完善了一些特殊字符处理的功能。

增加PageMapper功能，以后可以在任何地方使用注解模式来解析页面了！[Issue #120] 例如：

public void process(Page page) {
        //新建Mapper，GithubRepo是一个带注解的POJO
        PageMapper<GithubRepo> githubRepoPageMapper = new PageMapper<GithubRepo>(GithubRepo.class);
        //直接解析页面，得到解析后的结果
        GithubRepo githubRepo = githubRepoPageMapper.get(page);
        page.putField("repo",githubRepo);
    }

增加多个代理以及智能切换的支持，感谢@yxssfxwzy 贡献代码，使用Site.setHttpProxyPool可开启此功能。[Pull #128]

public void process(Page page) {
        Site site = Site.me().setHttpProxyPool(
                Lists.newArrayList(
                        new String[]{"192.168.0.2","8080"},
                        new String[]{"192.168.0.3","8080"}));
    }

Bugfix:

修复了JsonFilePipeline不能自动创建文件夹的问题。[Issue #122]
修复了Jsonp在removePadding时，对于特殊字符匹配不当的问题。[Issue #124]
修复了当JsonPathSelector选取的结果是非String类型时，类型转换出错的问题。[Issue #129]

WebMagic-0.5.1

10 years ago

此次更新主要包括Scheduler的一些改动，对于自己定制过Scheduler的用户，强烈推荐升级。

修复了RedisScheduler无法去重的BUG，感谢@codev777 仔细测试并发现问题。 #117
对Scheduler进行了重构，新增了接口DuplicateRemover，将去重单独抽象出来，以便在同一个Scheduler中选择不同的去重方式。 #118
增加了BloomFilter去重方式。BloomFilter是一种可以用极少的内存消耗完成大量URL去重的数据结构，缺点是会有少量非重复的URL被判断为重复，导致URL丢失(小于0.5%)。

使用以下的方式即可将默认的HashSet去重改为BloomFilter去重：

spider.setScheduler(new QueueScheduler()
.setDuplicateRemover(new BloomFilterDuplicateRemover(10000000)) //10000000是估计的页面数量

WebMagic-0.5.0

10 years ago

此次更新主要增加了监控功能，同时重写了多线程部分，使得多线程下性能有了极大的提升。另外还包含注解模式一些优化、多页面的支持等功能。

项目总体进展：

官网webmagic.io上线了！同时上线的还有详细版的官方文档http://webmagic.io/docs，从此使用更加简单！
新增三名合作开发者@ccliangbo @ouyanghuangzheng @linkerlin ，一起参与项目的维护。
官方论坛http://bbs.webmagic.io/和官方QQ群373225642上线，以后会更加重视社区的建设。

监控部分：

增加了监控功能，使用JMX可以监控页面数量、爬虫状态，并可以启动和终止爬虫。使用文档：http://webmagic.io/docs/posts/ch4-basic-page-processor/monitor.html #98

多线程部分：

重写了多线程部分，修复了多线程下，主分发线程会被工作线程阻塞的问题，使得多线程下效率有了极大的提升，推荐所有用户升级。 #110
为主线程等待新URL时的wait/notify机制增加了timeout时间，防止少数情况下发生的爬虫卡死的情况。 #111

抽取API部分：

增加了JSON的支持，现在可以使用page.getJson().jsonPath()来使用jsonPath解析AJAX请求，也可以使用page.getJson().removePadding().jsonPath()来解析JSONP请求。 #101
修复一个Selectable的缓存导致两次取出的结果不一致的问题。 #73 感谢@seveniu 发现问题
支持为一个Spider添加多个PageProcessor，并按照URL区分，感谢@sebastian1118 提交patch。使用示例：PatternProcessorExample #86
修复不常用标签无法使用nth-of-type选择的问题(例如//div/svg[2]) 。#75
修复XPath中包含特殊字符，即使转义也会导致解析失败的问题。#77

注解模式：

注解模式现在支持继承了！父类的注解将对子类也有效。#103
修复注解模式下，一个Spider使用多个Model时，可能不生效的问题，感谢 @ccliangbo 发现此问题。#85
修复sourceRegion中只有一个URL会被抽取出来的问题，感谢@jsinak 发现此问题。#107
修复了自动类型转换Formatter的一个BUG，现在可以自定义Formatter了。如果你不了解Formatter可以看这里：注解模式下结果的类型转换 #100

其他组件：

Downloader现在支持除了GET之外的其他几种HTTP请求了，包括POST、HEAD、PUT、DELETE、TRACE，感谢@usenrong 提出建议。 #108
在Site中设置Cookie时，可以指定域名，而不是只能使用默认域名了。 #109
setScheduler()方法在调用时，如果之前Scheduler已有URL，会先转移到新的Scheduler，避免URL丢失。 #104
在发布包中去掉了log4j.xml，避免与用户程序冲突，感谢@cnjavaer 发现问题。 #82

webmaigc-0.4.3

10 years ago

Bugfix:

Fix cycleRetryTimes does not work #58 #60 #62 @yxssfxwzy
Fix NullPointerException in FileCachedQueueScheduler #53 @xuchaoo
Fix Selenium does not quit #57 @d0ngw

Enhancement:

Enhance RegexSelector group check #51 @SimpleExpress
Add XPath syntax support: #64 contains,or/and,"|"
Add text attribute select to CssSelector #66
Change logger to slf4j #55
Update HttpClient version to 4.3.3 #59

webmagic-0.4.2

10 years ago

Enhancement: #45 Remove multi option in ExtractBy. Auto detect whether is multi be field type. Bugfix: #46 Downloader thread hang up sometiems.

webmagic-0.4.1

10 years ago

Fix some concurrent problem causing the spider not exit after all pages are downloaded. #36
#38 Use algorithm of https://code.google.com/p/cx-extractor/.

More support for ajax:

#39 Parsing html after page.getHtml()
#42 Add jsonpath support in annotation mode
#35 Add more http info to page
#41 Add more status monitor method to Spider

webmagic-0.4.0

10 years ago

Improve performance of Downloader.

Update HttpClient to 4.3.1 and rewrite the code of HttpClientDownloader #32.
Use gzip by default to reduce the transport cost #31.
Enable HTTP Keep-Alive and connection persistence, fix the wrong usage of PoolConnectionManage r#30.

The performance of Downloader is improved by 90% in my test.Test code: Kr36NewsModel.java.

Add synchronzing API for small task #28.

        OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(100), BaiduBaike.class);
        BaiduBaike baike = ooSpider.<BaiduBaike>get("http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8");
        System.out.println(baike);

More config for site

Http proxy support by Site.setHttpProxy #22.
More http header customizing support by Site.addHeader #27.
Allow disable gzip by Site.setUseGzip(false).
Move Site.addStartUrl to Spider.addUrl because I think startUrl is more a Spider's property than Site.

Code refactor in Spider

Refactor the multi-thread part of Spider and fix some concurrent problem.
Import Google Guava API for simpler code.
Allow add request with more information by Spider.addRequest() instead of addUrl #29.
Allow just downloading start urls without spawn urls extracted by Spider.setSpawnUrl(false).