JsoupXpath Versions Save

纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4. Maybe it is the best in java.Just try it.

v2.5.3

1 year ago

优化following-sibling following preceding-sibling preceding 行为,以便更好的适配文本提取场景,如下:

    @Test
    public void issue64And65(){
        String content = "<div class='a'>1</div>" +
                "<div>2</div>\n" +
                "<div class='a'>3</div>\n" +
                "<div>4</div>\n" +
                "<div>5</div>11" +
                "<tag>6</tag>12" +
                "<div>7<span>8</span></div>" +
                "";
        JXDocument j = JXDocument.create(content);
        Assert.assertEquals("7", j.selNOne("//div[text()='5']/following-sibling::div/text()").asString());
        Assert.assertEquals("6", j.selNOne("//div[text()='5']/following-sibling::tag/text()").asString());
        Assert.assertEquals("11", j.selNOne("//div[text()='5']/following-sibling::text()").asString());
        Assert.assertEquals("12", j.selNOne("//div[text()='7']/preceding-sibling::text()").asString());
        Assert.assertEquals("5", j.selNOne("//div[text()='7']/preceding-sibling::div/text()").asString());
        Assert.assertEquals("6", j.selNOne("//div[text()='7']/preceding-sibling::tag/text()").asString());
        Assert.assertEquals("6", j.selNOne("//div[text()='7']/preceding-sibling::tag/text()").asString());
        Assert.assertEquals("11 6 12 7 8", j.selN("//div[text()='5']/following::text()").stream().map(Objects::toString).collect(Collectors.joining(" ")).trim());
        Assert.assertEquals("6", j.selN("//div[text()='5']/following::tag/text()").stream().map(Objects::toString).collect(Collectors.joining(" ")).trim());
        Assert.assertEquals("8", j.selN("//div[text()='5']/following::span/text()").stream().map(Objects::toString).collect(Collectors.joining(" ")).trim());
        Assert.assertEquals("5 7", j.selN("//div[text()='4']/following::div/text()").stream().map(Objects::toString).collect(Collectors.joining(" ")).trim());
        Assert.assertEquals("2 1", j.selN("//div[text()='3']/preceding::text()").stream().map(Objects::toString).collect(Collectors.joining(" ")).trim());
        Assert.assertEquals("3  2 1", j.selN("//div[text()='4']/preceding::text()").stream().map(Objects::toString).collect(Collectors.joining(" ")).trim());
    }

以及豆瓣详情页提取测试:

    @Test
    public void testDoubanDetailInfoExtra() throws Exception{
        JXDocument doc = createFromResource("d_detail_page.html");
        JXNode score = doc.selNOne("//*[@id=\"interest_sectl\"]/div/div[2]/strong/text()");
        logger.info("{}", score.asString());
        JXNode title = doc.selNOne("//*[@id=\"wrapper\"]/h1/span/text()");
        logger.info("{}", title.asString());
        JXNode pageNum = doc.selNOne("//*[@id=\"info\"]/span[contains(text(),'页数')]/following-sibling::text()");
        logger.info("{}", pageNum.asString());
        Assert.assertEquals("956", pageNum.asString());
        JXNode price = doc.selNOne("//*[@id=\"info\"]/span[contains(text(),'定价')]/following-sibling::text()");
        logger.info("{}", price.asString());
        Assert.assertEquals("139.00元", price.asString());
    }

v2.5.2

1 year ago
  • last() 优化
  • jsoup 依赖版本升级,避免安全隐患

v2.5.1

2 years ago
  • 修复了 PrecedingSiblingOneSelector 这个函数无效的问题 , 感谢@s24963386贡献!
  • 修复 https://github.com/zhegexiaohuozi/JsoupXpath/issues/66 ,函数参数表达式使用的上下文不够全面的问题
  • 优化text() 块节点属性信息,以便更好的支持倒序索引
  • 增加double/long sum(node-set) 函数,计算给定的节点集合中数字节点值的和,计算参数范围内包含非数字内容则计算无效。
  • 优化num()结果表现,尽量符合用户使用直觉。整数返回整数,浮点数返回浮点数,不再统一只返回浮点数。

v2.5.0

2 years ago

升级部分依赖版本至最新版,功能没有变动和调整,各位同学可以根据各自的实际使用的情况选择是否升级至该版本。

  • Jsoup版本由 1.10.3升级至1.14.1
  • commons-lang3 版本 3.3.2 升级至 3.12.0
  • slf4j-api 版本 1.7.25 升级至 1.7.32

v2.4.3

3 years ago

修复text() 在文本比对时的问题。

fix https://github.com/zhegexiaohuozi/JsoupXpath/issues/53

v2.4.2

3 years ago

test: https://github.com/zhegexiaohuozi/JsoupXpath/blob/94fc9c79095c1909c552e1e7e6ef545d3271bdf4/src/test/java/org/seimicrawler/xpath/JXDocumentTest.java#L246

    @Test
    public void fixTextElNoParentTest(){
        String test="<div class='a'> a <div>need</div> <div class='e'> not need</div> c </div>";
        JXDocument j = JXDocument.create(test);
        List<JXNode> l = j.selN("//div[@class='a']//text()[not(ancestor::div[@class='e'])]");
        Set<String> finalRes = new HashSet<>();
        for (JXNode i : l){
            logger.info("{}",i.toString());
            finalRes.add(i.asString());
        }
        Assert.assertFalse(finalRes.contains("not need"));
        Assert.assertTrue(finalRes.contains("need"));
        Assert.assertEquals(4, finalRes.size());
    }

v2.4.1

3 years ago

之前版本的关于text()函数的实现有些简化了,在某些特殊场景无法做到按索引精准提取某一文本块。本次更新重构了text()函数,支持语法范围内全部标准行为。

    @Test
    public void FixTextBehaviorTest(){
        String html = "<p><span class=\"text-muted\">分类:</span>动漫<span class=\"split-line\"></span><span class=\"text-muted hidden-xs\">地区:</span>日本<span class=\"split-line\"></span><span class=\"text-muted hidden-xs\">年份:</span>2010</p>";
        JXDocument jxDocument = JXDocument.create(html);
        List<JXNode> jxNodes = jxDocument.selN("//text()[3]");
        String actual = StringUtils.join(jxNodes,"");
        logger.info("actual = {}",actual);
        Assert.assertEquals("2010", actual);
    }

对老代码的影响

text()不再简单的返回节点下的所有文本,而是按照标准语义识别出多个文本块,返回文本块列表,如

<p> one <span> two</span> three </p>
  • //text() 返回 ["one", "two", "three" ]
  • //text()[2] 返回 ["three"]
  • 每个文本块会自动去掉开头和结尾的空白

allText() 表现会和以前一样,可酌情使用

v2.3.0

5 years ago
  • 修复轴选择器筛选结果没有保证顺序的bug

  • 增加函数substring-before-lastsubstring-after-last ,感谢 @zzldnl 的贡献