Jsoup Versions Save

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

jsoup-1.17.2

4 months ago

Improvements

Attribute object accessors: Added Element.attribute(String) and Attributes.attribute(String) to more simply obtain an Attribute object. 2069
Attribute source tracking: If source tracking is on, and an Attribute's key is changed ( via Attribute.setKey(String)), the source range is now still tracked in Attribute.sourceRange(). 2070
Wildcard attribute selector: Added support for the [*] element with any attribute selector. And also restored support for selecting by an empty attribute name prefix ([^]). 2079

Bug Fixes

Mixed-cased source position: When tracking the source position of attributes, if the source attribute name was mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not tracked correctly. 2067
Source position NPE: When tracking the source position of a body fragment parse, a null pointer exception was thrown. 2068
Multi-point emoji entity: A multi-point encoded emoji entity may be incorrectly decoded to the replacement character. 2074
Selector sub-expressions: (Regression) in a selector like parent [attr=va], other, the , OR was binding to [attr=va] instead of parent [attr=va], causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr to represent the query, allowing simpler and more thorough query parse tests. 2073
XML CData output: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData section. 2078
Thread safety: The :has evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the iterator object is a thread-local. 2088

jsoup-1.17.1

5 months ago

jsoup 1.17.1 is out now with support for request-level authentication, attribute name & value source ranges, stream() iterable support, and a bunch of other improvements and bug fixes.

Many thanks to everyone who contributed to this release!

Improvements

Request-Level Authentication: Added support for request-level authentication in Jsoup.connect(), enabling authentication to proxies and servers. More.

Elements DOM Mutators: In the Elements list, added direct support for Elements#set(int, Element), Elements#remove(int), Elements#remove(Object), Elements#clear(), Elements#removeAll(), Elements#retainAll(), Elements#removeIf(), Elements#replaceAll(). These methods update the original DOM, as well as the Elements list. More.

Stream Interface: Introduced the NodeIterator class for efficient node tree traversal using the Iterator interface. Added Stream Element#stream() and Node#nodeStream() methods for fluent composable stream pipelines of node traversals. More.

XML OutputSettings: Automatically sets the xhtml EscapeMode as default when changing the OutputSettings syntax to XML.

is() Selector: Added the :is(selector list) pseudo-selector to find elements that match any selectors in the selector list. This enhances readability for large ORed selectors. More.

JPMS Module Support: Repackaged the library with native JPMS module support. More.

Source Position Fidelity: Improved fidelity of source positions when tracking is enabled. Implicitly created or closed elements are now trackable via Range.isImplicit(). More.

Attribute Source Positions: Enabled source position for attribute names and values when source tracking is on. Attribute#sourceRange() provides the ranges. More.

Virtual Threads: Enhanced performance under Java 21+ Virtual Threads by replacing the internal ConstrainableInputStream with ControllableInputStream. More.

XML Mimetype Support: Extended XML mimetype support in Jsoup.connect() to include any XML mimetype. More.

Bug Fixes

XML Data Nodes: Fixed a bug where HTML elements parsed as data nodes were not correctly emitted as CDATA nodes when outputting with XML syntax. More.

Immediate Parent Selector: Corrected a bug where the Immediate Parent selector > could match elements above the root context element. More.

Sub-Query Parsing: Resolved a bug where combinators following the , Or combinator in a sub-query were incorrectly skipped. More.

Empty Doctype: Fixed a bug in W3CDom where the conversion would fail if the jsoup input document contained an empty doctype. The doctype is now discarded, and the conversion continues.

SVG Elements Cleaning: Fixed incorrect nesting when cleaning a document containing SVG elements or other foreign elements with preserved-case names. More.

Unknown Self-Closing Tags: Preserved the output style of unknown self-closing tags from the input when cleaning a document. More.

Build Improvements

Local Test Proxy: Added a local test proxy implementation for proxy integration tests. More.

HTTPS Request Tests: Added tests for HTTPS request support using a local self-signed certificate. Includes proxy tests. More.

Changes

Response BodyStream: The InputStream returned in Connection.Response.bodyStream() is now a plain BufferedInputStream. More.

jsoup-1.16.2

6 months ago

Improvements

Optimized the performance of complex CSS selectors, by adding a cost-based query planner. Evaluators are sorted by their relative execution cost, and executed in order of lower to higher cost. This speeds the matching process by ensuring that simpler evaluations (such as a tag name match) are conducted prior to more complex evaluations (such as an attribute regex, or a deep child scan with a :has).

Added support for <svg> and <math> tags (and their children). This includes tag namespaces and case preservation on applicable tags and attributes. #2008

When converting jsoup Documents to W3C Documents in W3CDom, HTML documents will be placed in the http://www.w3.org/1999/xhtml namespace by default, per the HTML5 spec. This can be controlled by setting W3CDom#namespaceAware(boolean false). #1848

Speed optimized the Structural Evaluators by memoizing previous evaluations. Particularly the ~ (any preceding sibling) and :nth-of-type selectors are improved. #1956

Tweaked the performance of the Element nextElementSibling, previousElementSibling, firstElementSibling, lastElementSibling, firstElementChild, and `lastElementChild. They now inplace filter/skip in the child-node list, vs having to allocate and scan a complete Element filtered list.

Optimized internal methods that previously called Element.children() to use filter/skip child-node list accessors instead, reducing new Element List allocations.

Tweaked the performance of parsing :pseudo selectors.

When using the :empty pseudo-selector, blank textnodes are now considered empty. Previously, an element containing any whitespace was not considered empty. #1976

In forms, <input type="image"> should be excluded from Element.formData() (and hence from form submissions). #2010

In Safelist, made isSafeTag() and isSafeAttribute() public methods, for extensibility. #1780

Bug Fixes

Bugfix: form elements and empty elements (such as img) did not have their attributes de-duplicated. #1950

If Document.OutputSettings was cloned from a clone, an NPE would be thrown when used. #1964

In Jsoup.connect(String url), URL paths containing a %2B were incorrectly recoded to a '+', or a '+' was recoded to a ' '. Fixed by reverting to the previous behavior of not encoding supplied paths, other than normalizing to ASCII. #1952

In Jsoup.connect(String url), strings containing supplemental characters (e.g. emoji) were not URL escaped correctly.

In Jsoup.connect(String url), the ConstrainableInputStream would clear Thread interrupts when reading the body. This precluded callers from spawning a thread, running a number of requests for a length of time, then joining that thread after interrupting it. #1991

When tracking HTML source positions, the closing tags for H1...H6 elements were not tracked correctly. #1987

In Jsoup.connect(), a DELETE method request did not support a request body. #1972

When calling Element.cssSelector() on an extremely deeply nested element, a StackOverflowError could occur. Further, a StackOverflowError may occur when running the query. #2001

Appending a node back to its original Element after empty() would throw an Index out of bounds exception. Also, now the child nodes that were removed have their parent node cleared, fully detaching them from the original parent. #2013

In Connection when adding headers, the value may have been assumed to be an incorrectly decoded ISO_8859_1 string, and re-encoded as UTF-8. The value is now left as-is.

Changes

Removed previously deprecated methods Document.normalise(), Element.forEach(org.jsoup.helper.Consumer<>), Node.forEach(org.jsoup.helper.Consumer<>), and the org.jsoup.helper.Consumer interface; the latter being a previously required compatibility shim prior to Android's de-sugaring support.

The previous compatibility shim org.jsoup.UncheckedIOException is deprecated in favor of the now supported java.io.UncheckedIOException. If you are catching the former, modify your code to catch the latter instead. #1989

Blocked noscript tags from being added to Safelists, due to incompatibilities between parsers with and without script-mode enabled.

jsoup-1.16.1

1 year ago

jsoup Java HTML Parser release 1.16.1

Improvements

In Jsoup.connect(String url), natively support URLs with Unicode characters in the path or query string, without having to be escaped by the caller. #1914

Calling Node.remove() on a node with no parent is now a no-op, vs a validation error. #1898

Bug Fixes

Aligned the HTML Tree Builder processing steps for AfterBody and AfterAfterBody to the updated WHATWG standard, to not pop the stack to close <body> or <html> elements. This prevents an errant </html> closing the preceding structure. Also added appropriate error message outputs in this case. #1851

Corrected support for ruby elements (<ruby>, <rp>, <rt>, and <rtc>) to current spec. #1294

When using Node.before(Node) or Node.after(Node), if the incoming node was a sibling of the context node, the incoming node may be inserted into the wrong relative location. #1898

In Jsoup.connect(String url), if the input URL had components that were already % escaped, they would be escaped again, causing errors when fetched. #1902

When tracking input source positions, text in tables that was fostered had invalid positions. #1927

If the Document.OutputSettings class was initialized, and then Entities.escape(String) called, an NPE may be thrown due to a class loading circular dependency. #1910

When pretty-printing, the first inline Element or Comment in a block would not be wrap-indented if it were preceded by a blank text node. #1906

When pretty-printing a <pre> containing block tags, those tags were incorrectly indented. #1891

When pretty-printing nested inlineable blocks (such as a <p> in a <td>), the inner element should be indented. #1926

<br> tags should be wrap-indented when in block tags (and not when in inline tags). #1911

The contents of a sufficiently large <textarea> with un-escaped HTML closing tags may be incorrectly parsed to an empty node. #1929

jsoup-1.15.4

1 year ago

jsoup Java HTML Parser release 1.15.4

jsoup 1.15.4 is out now, and includes a bunch of improvements, particularly when pretty-printing HTML, and bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Download jsoup now.

Improvements

Added the ability to escape CSS selectors (tags, IDs, classes) to match elements that don't follow regular CSS syntax. For example, to match by classname <p class="one.two">, use document.select("p.one\\.two"); #838

When pretty-printing, wrap text that follows a <br> tag. #1858

When pretty-printing, normalize newlines that follow self-closing tags in custom tags. #1852

When pretty-printing, collapse non-significant whitespace between a block and an inline tag. #1802

In Element.forEach() and Node.forEachNode(), use java.util.function.Consumer instead of the previous Android compatibility shim org.jsoup.helper.Consumer. Subsequently, the latter has been deprecated. #1870

Added a new method Document.forms(), to conveniently retrieve a List<FormElement> containing the <form> elements in a document.

Added a new method Document.expectForm(), to find the first matching FormElement, or blow up trying.

Bug Fixes

URLs containing characters such as and <code> were not escaped correctly, and would throw a MalformedURLException when fetched. #1873

Element.cssSelector() would create invalid selectors for elements where the tag name, ID, or classnames needed to be escaped (e.g. if a class name contained a : or .). #1742

Element.text() should have a space between a block and an inline element. #1877

If a Node or an Element was replaced with itself, that node would incorrectly be orphaned. #1843

Form data on a previous request was copied to a new request in newRequest(), resulting in an accumulation of form data when executing multi-step form submissions, or data sent to later requests incorrectly. Now, newRequest() only copies session related settings (cookies, proxy settings, user-agent, etc) but not the request data nor the body. #1778

Fixed an issue in Safelist.removeAttributes() which could throw a ConcurrentModificationException when using the :all pseudo-attribute.

Given extremely deeply nested HTML, a number of methods in Element could throw a StackOverflowError due to excessive recursion. Namely: #data(), #hasText(), #parents(), and #wrap(html). #1864

Changes

Deprecated the unused Document.normalise() method. Normalization occurs during the HTML tree construction, and no longer as a distinct phase.

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

You can also follow me (@[email protected]) on Mastodon / Fediverse to receive occasional notes about jsoup releases.

jsoup-1.15.3

1 year ago

jsoup 1.15.3 is out now, and includes a security fix for potential XSS attacks, along with other bug fixes and improvements, including more descriptive validation error messages.

Details:

jsoup-1.15.2

1 year ago

jsoup 1.15.2 is out now with a bunch of improvements and bug fixes.

jsoup-1.15.1

2 years ago

jsoup 1.15.1 is out now with a bunch of improvements and bug fixes.

jsoup-1.14.3

2 years ago

jsoup 1.14.3 is out now, adding native XPath selector support, improved <template> support, and also includes a bunch of bug fixes, improvements, and performance enhancements.

See the release announcement for the full changelog.

jsoup-1.14.2

2 years ago

Caught by the fuzz! jsoup 1.14.2 is out now, and includes a set of parser bug fixes and improvements for handling rough HTML and XML, as identified by the Jazzer JVM fuzzer. This release also includes other fixes and improvements.

See the release announcement for the full changelog.