Lua library for sanitizing, parsing, and editing untrusted HTML
luarocks install web_sanitize
https://luarocks.org/modules/leafo/web_sanitize
/>
syntax for immediately closing an opening tag is only accepted as valid if the tag type is listed in the self_closing
object in the whitelist. Previously it would have been possible to write something like <div/>
and have it pass through the sanitizer. This would cause the browser to render any subsequent content that doesn't close that nesting inside of that tag, allowing the input markup to influence the appearance of content outside the sanitized area.self_closing
tags to include a few more common void tags.Full Changelog: https://github.com/leafo/web_sanitize/compare/v1.4.0...v1.5.0
This is a critical update if you are using a custom white list with iframe
elements allowed. Due to their non-standard parsing within browsers it maybe be possible to craft HTML to bypass sanitization by using an element with an attribute value of a closing iframe tag. Those using the default whitelist are not affected.
https://luarocks.org/modules/leafo/web_sanitize
<
and >
characters in the value replaced with <
and >
<
or >
)<
>
, &
, etc.)Full Changelog: https://github.com/leafo/web_sanitize/compare/v1.3.0...v1.4.0
https://luarocks.org/modules/leafo/web_sanitize
This update includes a fix for the stack overflow (too many captures)
error produced by LPeg when parsing too large of an input.
The structure of the parser has been modified slightly when handing inputs that are over 10kb.
stack overflow (too many captures)
fixPreviously, the parser essentially boiled down to: html = Ct (open_tag + close_tag + html_entity + escaped_html_char + text)^0 * -1
. Each thing that can be parsed (open_tag
, etc.) is captured into a chunk of text that is placed into a table with Ct
.
As Ct
appends each parsed item into an array in a linear format, you would expect it to work for any size input, but it seems even a non-nested setup like this causes the stack overflow error to happen within LPeg. Each capture is left open until the entire string is finished parsing, so the capture limit is hit in larger inputs with a large number of tags.
In order to fix the error, I would have to reduce the number of active captures. My thought was to chunk the parsing reading X at a time, convert them to a string, then put that chunk into the buffer. (Ideally the chunking can happen directly as part of the LPeg pattern, and no Lua glue code is needed)
Initially I thought something like this would work:
html_chunk = open_tag + close_tag + html_entity + escaped_html_char + text
html = Ct (
Cs(html_chunk * html_chunk^-1000)^0 -- Read up to 1000 html chunks in a loop
)
The idea being that Cs
will convert up to 1000 of the html chunks to a plain string, and close out all those captures. The error still happened though, likely because the finalization of captures only happens at the very end of parsing. I remembered about Cmt
which has to resolve the captures at run time to provide your callback function with arguments.
So the the final approach is to flatten a chunk of captures using Cmt
as follows:
html_chunk = open_tag + close_tag + html_entity + escaped_html_char + text
flatten = (p) -> Cmt p, (s, p, c) -> true, table.concat c
html = Ct (
flatten(Ct(html_chunk * html_chunk^-1000))^0
)
This can be a useful trick if you have to parse very large inputs with LPeg. You can chunk the parsing using Cmt
to avoid hitting the stack limit. This approach only works for things you can parse linearly, so if you have nested structures you're probably out of luck. In order to parse something nested (html) linearly, you can use a separate stack that's checked against via Cmt
patterns to verify if the nesting is valid.
https://luarocks.org/modules/leafo/web_sanitize
cdata
and type set to text_node
)script
and style
. The content of these tags is read as text until the respective closing tag, no nested tags are parsed inside of them.tr
, td
tags when defining a table, or auto closing li
tags when defining a list.p
tags when an invalid block level tag is included insideattr
field on node now matches the format used on the HTML sanitizer. All attributes are included in tuple form ({ key, value}
including duplicates) in the array portion of the table, and then lowercase key, values are stored as fields in the table, with the right most value overwriting any duplicatesupdate_attributes
method on Node class for replacing rewriting an element's attributes with specified ones in addition to existing onesreplace_attributes
will write all attributes both in tuple and table form ({"hello", "world"}
and { hello = "world" }
, if you have multiple entries with the same name then they will all be written) The argument format is analogous to the attr
field of a parsed nodeCmt
)scan_html
/replace_html
, including interface for the node and stackweb_sanitize.patterns
module with some common patterns for parsing HTML (Although this module is undocumented, the interface should be relatively stable)Full Changelog: https://github.com/leafo/web_sanitize/compare/v1.1.0...v1.2.0