JsoupXpath Versions Save

纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4. Maybe it is the best in java.Just try it.

v2.2.1

5 years ago

v2.2

5 years ago
  • 剔除对类扫描器fast-classpath-scanner的依赖,以增强稳定性
  • 修复 | 语义问题

2.1-Beta

6 years ago
  • 优化

2.0-Beta

6 years ago

正式发布公测版

  • JsoupXpath语法解析处理采用Antlr4进行了重构,从而支持完备的W3C XPATH 1.0标准语法,提供更加强大的解析和处理能力。W3C规范,http://www.w3.org/TR/1999/REC-xpath-19991116 ,JsoupXpath语法描述文件Xpath.g4
  • 优化架构使得开发者为JsoupXpath贡献尚未实现的标准函数更加方便,在自己的项目中添加自定义函数也易如反掌。
  • 添加了一个工具包,方便大家直接用来体验xpath语法 jsoupxpath-tool-1.0 ,工具包本身是用spring-boot及spring-shell开发的,需要>=jdk8。JsoupXpath本身对jdk的要求是 >=jdk7,下面是它的使用示例,windows下控制台请开启utf-8编码。当然,这个小工具只是在大家不方便自己创建项目时测试使用,最好还是直接自己调用 JsoupXpath去感受
bash-4.1$ ./jsoupxpath-tool-1.0.jar 

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.0.1.RELEASE)

2018-04-12 00:02:20.112  INFO 14642 --- [           main] c.w.boot.JsoupXpathApplication           : Starting JsoupXpathApplication v1.0 on localhost with PID 14642 (/opt/vhost/dev/spring-boot-xpath/target/jsoupxpath-tool-1.0.jar started by resin in /opt/vhost/dev/spring-boot-xpath/target)
2018-04-12 00:02:20.120  INFO 14642 --- [           main] c.w.boot.JsoupXpathApplication           : No active profile set, falling back to default profiles: default
2018-04-12 00:02:20.176  INFO 14642 --- [           main] s.c.a.AnnotationConfigApplicationContext : Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@5679c6c6: startup date [Thu Apr 12 00:02:20 CST 2018]; root of context hierarchy
2018-04-12 00:02:21.516  INFO 14642 --- [           main] o.s.j.e.a.AnnotationMBeanExporter        : Registering beans for JMX exposure on startup
2018-04-12 00:02:21.530  INFO 14642 --- [           main] c.w.boot.JsoupXpathApplication           : Started JsoupXpathApplication in 1.85 seconds (JVM running for 2.435)
shell:>help
AVAILABLE COMMANDS

Built-In Commands
        clear: Clear the shell screen.
        exit, quit: Exit the shell.
        help: Display help about available commands.
        script: Read and execute commands from a file.
        stacktrace: Display the full stacktrace of the last error.

Xpath Extra
        get: init JXDocument by url
        xpath: extract by xpath


shell:>get https://book.douban.com/tag/%E4%BA%92%E8%81%94%E7%BD%91
Document init done.
shell:>xpath //ul[@class=\'subject-list\']/li[self::li/div/div/span[@class=\'pl\']/num()>10000][-1]/div/h2/allText()    
2018-04-12 00:03:45.597  INFO 14642 --- [           main] cn.wanghaomiao.boot.cmd.XpathExtra       : xpath = //ul[@class='subject-list']/li[self::li/div/div/span[@class='pl']/num()>10000][-1]/div/h2/allText()
长尾理论

shell:>xpath //*[@id=\"subject_list\"]/ul[1]/li[8]/div[2]/div[2]/span[3]/num()  
2018-04-12 00:04:23.420  INFO 14642 --- [           main] cn.wanghaomiao.boot.cmd.XpathExtra       : xpath = //*[@id="subject_list"]/ul[1]/li[8]/div[2]/div[2]/span[3]/num()
4333.0

shell:>

下面是JsoupXpath的基于Antlr4的语法解析树示例,方便大家更快速的一览JsoupXpath的语法处理能力与语法解析执行过程

  • //ul[@class='subject-list']/li[./div/div/span[@class='pl']/num()>(1000+90*(2*50))][last()][1]/div/h2/allText() 这个主要是一些表达式嵌套的解析示例,点击图片可以查看大图 muti_expr

  • //ul[@class='subject-list']/li[not(contains(self::li/div/div/span[@class='pl']//text(),'14582'))]/div/h2//text() 这个是对内置函数支持的一个解析示例,点击图片可以查看大图 functions

2.0.2-alpha

6 years ago
  • 修复已知Bug
  • 移除对guava的相关依赖
  • 针对fat jar的场景进行适配,如spring boot项目
  • 添加了一个工具包,方便大家直接用来体验xpath语法 jsoupxpath-tool-1.0 ,工具包本身是用spring-boot及spring-shell开发的,需要>=jdk8。JsoupXpath本身对jdk的要求是 >=jdk7,下面是它的使用示例,windows下控制台请开启utf-8编码。当然,这个小工具只是在大家不方便自己创建项目时测试使用,最好还是直接自己调用 JsoupXpath去感受
<dependency>
   <groupId>cn.wanghaomiao</groupId>
   <artifactId>JsoupXpath</artifactId>
   <version>2.0.2-alpha</version>
</dependency>
bash-4.1$ ./jsoupxpath-tool-1.0.jar 

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.0.1.RELEASE)

2018-04-12 00:02:20.112  INFO 14642 --- [           main] c.w.boot.JsoupXpathApplication           : Starting JsoupXpathApplication v1.0 on localhost with PID 14642 (/opt/vhost/dev/spring-boot-xpath/target/jsoupxpath-tool-1.0.jar started by resin in /opt/vhost/dev/spring-boot-xpath/target)
2018-04-12 00:02:20.120  INFO 14642 --- [           main] c.w.boot.JsoupXpathApplication           : No active profile set, falling back to default profiles: default
2018-04-12 00:02:20.176  INFO 14642 --- [           main] s.c.a.AnnotationConfigApplicationContext : Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@5679c6c6: startup date [Thu Apr 12 00:02:20 CST 2018]; root of context hierarchy
2018-04-12 00:02:21.516  INFO 14642 --- [           main] o.s.j.e.a.AnnotationMBeanExporter        : Registering beans for JMX exposure on startup
2018-04-12 00:02:21.530  INFO 14642 --- [           main] c.w.boot.JsoupXpathApplication           : Started JsoupXpathApplication in 1.85 seconds (JVM running for 2.435)
shell:>help
AVAILABLE COMMANDS

Built-In Commands
        clear: Clear the shell screen.
        exit, quit: Exit the shell.
        help: Display help about available commands.
        script: Read and execute commands from a file.
        stacktrace: Display the full stacktrace of the last error.

Xpath Extra
        get: init JXDocument by url
        xpath: extract by xpath


shell:>get https://book.douban.com/tag/%E4%BA%92%E8%81%94%E7%BD%91
Document init done.
shell:>xpath //ul[@class=\'subject-list\']/li[self::li/div/div/span[@class=\'pl\']/num()>10000][-1]/div/h2/allText()    
2018-04-12 00:03:45.597  INFO 14642 --- [           main] cn.wanghaomiao.boot.cmd.XpathExtra       : xpath = //ul[@class='subject-list']/li[self::li/div/div/span[@class='pl']/num()>10000][-1]/div/h2/allText()
长尾理论

shell:>xpath //*[@id=\"subject_list\"]/ul[1]/li[8]/div[2]/div[2]/span[3]/num()  
2018-04-12 00:04:23.420  INFO 14642 --- [           main] cn.wanghaomiao.boot.cmd.XpathExtra       : xpath = //*[@id="subject_list"]/ul[1]/li[8]/div[2]/div[2]/span[3]/num()
4333.0

shell:>

2.0-alpha

6 years ago
  • JsoupXpath语法解析处理采用Antlr4进行了重构,从而支持完备的W3C XPATH 1.0标准语法,提供更加强大的解析和处理能力。W3C规范,http://www.w3.org/TR/1999/REC-xpath-19991116 ,JsoupXpath语法描述文件Xpath.g4
  • 优化架构使得开发者为JsoupXpath贡献尚未实现的标准函数更加方便,在自己的项目中添加自定义函数也易如反掌。

下面是JsoupXpath的基于Antlr4的语法解析树示例,方便大家更快速的一览JsoupXpath的语法处理能力与语法解析执行过程

  • //ul[@class='subject-list']/li[./div/div/span[@class='pl']/num()>(1000+90*(2*50))][last()][1]/div/h2/allText() 这个主要是一些表达式嵌套的解析示例,点击图片可以查看大图 muti_expr

  • //ul[@class='subject-list']/li[not(contains(self::li/div/div/span[@class='pl']//text(),'14582'))]/div/h2//text() 这个是对内置函数支持的一个解析示例,点击图片可以查看大图 functions

v0.3.2

7 years ago
  • 优化谓语提取