jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
Element.attribute(String)
and Attributes.attribute(String)
to more simply
obtain an Attribute
object. 2069
Attribute.setKey(String)
), the source range is now still tracked
in Attribute.sourceRange()
. 2070
[*]
element with any attribute selector. And also restored
support for selecting by an empty attribute name prefix ([^]
). 2079
parent [attr=va], other
, the , OR
was binding
to [attr=va]
instead of parent [attr=va]
, causing incorrect selections. The fix includes a EvaluatorDebug class
that generates a sexpr to represent the query, allowing simpler and more thorough query parse
tests. 2073
:has
evaluator held a non-thread-safe Iterator, and so if an Evaluator object was
shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be
incorrect. Now, the iterator object is a thread-local. 2088
jsoup 1.17.1 is out now with support for request-level authentication, attribute name & value source ranges, stream() iterable support, and a bunch of other improvements and bug fixes.
Many thanks to everyone who contributed to this release!
Elements
list, added direct support for Elements#set(int, Element)
, Elements#remove(int)
, Elements#remove(Object)
, Elements#clear()
, Elements#removeAll()
, Elements#retainAll()
, Elements#removeIf()
, Elements#replaceAll()
. These methods update the original DOM, as well as the Elements list. More.NodeIterator
class for efficient node tree traversal using the Iterator interface. Added Stream Element#stream()
and Node#nodeStream()
methods for fluent composable stream pipelines of node traversals. More.EscapeMode
as default when changing the OutputSettings
syntax to XML
.:is(selector list)
pseudo-selector to find elements that match any selectors in the selector list. This enhances readability for large OR
ed selectors. More.Range.isImplicit()
. More.Attribute#sourceRange()
provides the ranges. More.ConstrainableInputStream
with ControllableInputStream
. More.Jsoup.connect()
to include any XML mimetype. More.CDATA
nodes when outputting with XML
syntax. More.>
could match elements above the root context element. More.,
Or combinator in a sub-query were incorrectly skipped. More.W3CDom
where the conversion would fail if the jsoup input document contained an empty doctype. The doctype is now discarded, and the conversion continues.Connection.Response.bodyStream()
is now a plain BufferedInputStream
. More.<svg>
and <math>
tags (and their children). This includes tag namespaces and case preservation on applicable tags and attributes. #2008
W3CDom
, HTML documents will be placed in the http://www.w3.org/1999/xhtml
namespace by default, per the HTML5 spec. This can be controlled by setting W3CDom#namespaceAware(boolean false)
. #1848
~
(any preceding sibling) and :nth-of-type
selectors are improved. #1956
Element
nextElementSibling
, previousElementSibling
, firstElementSibling
, lastElementSibling
, firstElementChild
, and `lastElementChild. They now inplace filter/skip in the child-node list, vs having to allocate and scan a complete Element filtered list.Element.children()
to use filter/skip child-node list accessors instead, reducing new Element List allocations.:pseudo
selectors.:empty
pseudo-selector, blank textnodes are now considered empty. Previously, an element containing any whitespace was not considered empty. #1976
<input type="image">
should be excluded from Element.formData()
(and hence from form submissions). #2010
form
elements and empty elements (such as img
) did not have their attributes de-duplicated. #1950
Document.OutputSettings
was cloned from a clone, an NPE would be thrown when used. #1964
Jsoup.connect(String url)
, URL paths containing a %2B were incorrectly recoded to a '+', or a '+' was recoded to a ' '. Fixed by reverting to the previous behavior of not encoding supplied paths, other than normalizing to ASCII. #1952
Jsoup.connect(String url)
, strings containing supplemental characters (e.g. emoji) were not URL escaped correctly.Jsoup.connect(String url)
, the ConstrainableInputStream would clear Thread interrupts when reading the body. This precluded callers from spawning a thread, running a number of requests for a length of time, then joining that thread after interrupting it. #1991
H1
...H6
elements were not tracked correctly. #1987
Jsoup.connect()
, a DELETE
method request did not support a request body. #1972
Element.cssSelector()
on an extremely deeply nested element, a StackOverflowError
could occur. Further, a StackOverflowError
may occur when running the query. #2001
Element
after empty()
would throw an Index out of bounds exception. Also, now the child nodes that were removed have their parent node cleared, fully detaching them from the original parent. #2013
Connection
when adding headers, the value may have been assumed to be an incorrectly decoded ISO_8859_1
string, and re-encoded as UTF-8
. The value is now left as-is.Document.normalise()
, Element.forEach(org.jsoup.helper.Consumer<>)
, Node.forEach(org.jsoup.helper.Consumer<>)
, and the org.jsoup.helper.Consumer
interface; the latter being a previously required compatibility shim prior to Android's de-sugaring support.org.jsoup.UncheckedIOException
is deprecated in favor of the now supported java.io.UncheckedIOException
. If you are catching the former, modify your code to catch the latter instead. #1989
noscript
tags from being added to Safelists, due to incompatibilities between parsers with and without script-mode enabled.Jsoup.connect(String url)
, natively support URLs with Unicode characters in the path or query string, without having to be escaped by the caller. #1914Node.remove()
on a node with no parent is now a no-op, vs a validation error. #1898AfterBody
and AfterAfterBody
to the updated WHATWG standard, to not pop the stack to close <body>
or <html>
elements. This prevents an errant </html>
closing the preceding structure. Also added appropriate error message outputs in this case. #1851<ruby>
, <rp>
, <rt>
, and <rtc>
) to current spec. #1294Node.before(Node)
or Node.after(Node)
, if the incoming node was a sibling of the context node, the incoming node may be inserted into the wrong relative location. #1898Jsoup.connect(String url)
, if the input URL had components that were already %
escaped, they would be escaped again, causing errors when fetched. #1902Document.OutputSettings
class was initialized, and then Entities.escape(String)
called, an NPE may be thrown due to a class loading circular dependency. #1910Element
or Comment
in a block would not be wrap-indented if it were preceded by a blank text node. #1906<pre>
containing block tags, those tags were incorrectly indented. #1891<p>
in a <td>
), the inner element should be indented. #1926<br>
tags should be wrap-indented when in block tags (and not when in inline tags). #1911<textarea>
with un-escaped HTML closing tags may be incorrectly parsed to an empty node. #1929jsoup 1.15.4 is out now, and includes a bunch of improvements, particularly when pretty-printing HTML, and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
<p class="one.two">
, use document.select("p.one\\.two");
#838<br>
tag. #1858Element.forEach()
and Node.forEachNode()
, use java.util.function.Consumer
instead of the previous Android compatibility shim org.jsoup.helper.Consumer
. Subsequently, the latter has been deprecated. #1870Document.forms()
, to conveniently retrieve a List<FormElement>
containing the <form>
elements in a document.Document.expectForm()
, to find the first matching FormElement
, or blow up trying.and <code>
were not escaped correctly, and would throw a MalformedURLException
when fetched. #1873Element.cssSelector()
would create invalid selectors for elements where the tag name, ID, or classnames needed to be escaped (e.g. if a class name contained a :
or .
). #1742Element.text()
should have a space between a block and an inline element. #1877newRequest()
, resulting in an accumulation of form data when executing multi-step form submissions, or data sent to later requests incorrectly. Now, newRequest()
only copies session related settings (cookies, proxy settings, user-agent, etc) but not the request data nor the body. #1778Safelist.removeAttributes()
which could throw a ConcurrentModificationException
when using the :all
pseudo-attribute.Element
could throw a StackOverflowError
due to excessive recursion. Namely: #data()
, #hasText()
, #parents()
, and #wrap(html)
. #1864Document.normalise()
method. Normalization occurs during the HTML tree construction, and no longer as a distinct phase.My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@[email protected]) on Mastodon / Fediverse to receive occasional notes about jsoup releases.
jsoup 1.15.3 is out now, and includes a security fix for potential XSS attacks, along with other bug fixes and improvements, including more descriptive validation error messages.
Details:
jsoup 1.15.2 is out now with a bunch of improvements and bug fixes.
jsoup 1.15.1 is out now with a bunch of improvements and bug fixes.
jsoup 1.14.3 is out now, adding native XPath selector support, improved <template>
support, and also includes a bunch of bug fixes, improvements, and performance enhancements.
See the release announcement for the full changelog.
Caught by the fuzz! jsoup 1.14.2 is out now, and includes a set of parser bug fixes and improvements for handling rough HTML and XML, as identified by the Jazzer JVM fuzzer. This release also includes other fixes and improvements.
See the release announcement for the full changelog.