Web Sanitize Versions Save

Lua library for sanitizing, parsing, and editing untrusted HTML

v1.5.0

1 year ago
luarocks install web_sanitize

https://luarocks.org/modules/leafo/web_sanitize

Changes

  • The self closing /> syntax for immediately closing an opening tag is only accepted as valid if the tag type is listed in the self_closing object in the whitelist. Previously it would have been possible to write something like <div/> and have it pass through the sanitizer. This would cause the browser to render any subsequent content that doesn't close that nesting inside of that tag, allowing the input markup to influence the appearance of content outside the sanitized area.
  • Update the default list of self_closing tags to include a few more common void tags.

Full Changelog: https://github.com/leafo/web_sanitize/compare/v1.4.0...v1.5.0

v1.4.0

1 year ago

This is a critical update if you are using a custom white list with iframe elements allowed. Due to their non-standard parsing within browsers it maybe be possible to craft HTML to bypass sanitization by using an element with an attribute value of a closing iframe tag. Those using the default whitelist are not affected.

https://luarocks.org/modules/leafo/web_sanitize

  • Make attribute escaping more strict:
    • Approved attributes passed through will now always have < and > characters in the value replaced with &lt; and &gt
    • Injected attributes (attributes that are added despite not previously being there): The value will be escaped as HTML text to ensure invalid markup can't be returned by the injection function/literal
    • Injected attribute names will throw an error if using invalid characters (like < or >)
    • Note: Modified attributes will continue to function the same: the value provided is escaped as HTML text, (escaping < >, &, etc.)

Full Changelog: https://github.com/leafo/web_sanitize/compare/v1.3.0...v1.4.0

v1.3.0

2 years ago

https://luarocks.org/modules/leafo/web_sanitize

This update includes a fix for the stack overflow (too many captures) error produced by LPeg when parsing too large of an input.

The structure of the parser has been modified slightly when handing inputs that are over 10kb.

Changes

  • Add separate pattern for sanitizing HTML inputs that are over 10kb
  • Add an assertion when parsing query selectors with the scanning library

Summary of stack overflow (too many captures) fix

Previously, the parser essentially boiled down to: html = Ct (open_tag + close_tag + html_entity + escaped_html_char + text)^0 * -1. Each thing that can be parsed (open_tag, etc.) is captured into a chunk of text that is placed into a table with Ct.

As Ct appends each parsed item into an array in a linear format, you would expect it to work for any size input, but it seems even a non-nested setup like this causes the stack overflow error to happen within LPeg. Each capture is left open until the entire string is finished parsing, so the capture limit is hit in larger inputs with a large number of tags.

In order to fix the error, I would have to reduce the number of active captures. My thought was to chunk the parsing reading X at a time, convert them to a string, then put that chunk into the buffer. (Ideally the chunking can happen directly as part of the LPeg pattern, and no Lua glue code is needed)

Initially I thought something like this would work:

html_chunk = open_tag + close_tag + html_entity + escaped_html_char + text

html = Ct (
  Cs(html_chunk * html_chunk^-1000)^0 -- Read up to 1000 html chunks in a loop
)

The idea being that Cs will convert up to 1000 of the html chunks to a plain string, and close out all those captures. The error still happened though, likely because the finalization of captures only happens at the very end of parsing. I remembered about Cmt which has to resolve the captures at run time to provide your callback function with arguments.

So the the final approach is to flatten a chunk of captures using Cmt as follows:

html_chunk = open_tag + close_tag + html_entity + escaped_html_char + text

flatten = (p) -> Cmt p, (s, p, c) -> true, table.concat c

html = Ct (
  flatten(Ct(html_chunk * html_chunk^-1000))^0
)

This can be a useful trick if you have to parse very large inputs with LPeg. You can chunk the parsing using Cmt to avoid hitting the stack limit. This approach only works for things you can parse linearly, so if you have nested structures you're probably out of luck. In order to parse something nested (html) linearly, you can use a separate stack that's checked against via Cmt patterns to verify if the nesting is valid.

v1.2.0

2 years ago

https://luarocks.org/modules/leafo/web_sanitize

  • Substantial updates to the HTML scanner/updater
    • Support parsing HTML comments (they will no longer be part of text nodes, any markup inside of a comment will be completely passed over)
    • Support parsing CDATA tags, markup inside of them will not be parsed. They are emitted as individual text nodes when text nodes are enabled (with tag set to cdata and type set to text_node)
    • Support parsing "raw text" tags like script and style. The content of these tags is read as text until the respective closing tag, no nested tags are parsed inside of them.
    • Support for parsing auto closing tags as defined by the HTML spec. This includes things like auto-closing tr, td tags when defining a table, or auto closing li tags when defining a list.
    • Support for auto closing p tags when an invalid block level tag is included inside
    • The format of the attr field on node now matches the format used on the HTML sanitizer. All attributes are included in tuple form ({ key, value} including duplicates) in the array portion of the table, and then lowercase key, values are stored as fields in the table, with the right most value overwriting any duplicates
    • Add update_attributes method on Node class for replacing rewriting an element's attributes with specified ones in addition to existing ones
    • replace_attributes will write all attributes both in tuple and table form ({"hello", "world"} and { hello = "world" }, if you have multiple entries with the same name then they will all be written) The argument format is analogous to the attr field of a parsed node
    • Fix a bug for automatically closed tags due to parent closing was not working as intended. It caused the tag to be closed at the end of the document with the contents taking up the rest of the document
    • Be more diligent about reusing Lpeg patterns where possible to avoid extra allocations when running the scanner
    • Refactored parsing primitives to parse things more atomically (reducing use of Cmt)
  • The HTML entity decoder now correctly respects that HTML entities are case sensitive
  • Add more documentation for scan_html/replace_html, including interface for the node and stack
  • Add web_sanitize.patterns module with some common patterns for parsing HTML (Although this module is undocumented, the interface should be relatively stable)

Full Changelog: https://github.com/leafo/web_sanitize/compare/v1.1.0...v1.2.0

v1.1.0

3 years ago
  • Update text extractor
    • Add option for extracting as html or as plain text
    • Add option for removing non-printable characters
    • Add HTML entitiy translation when extracting as plain text
    • Whitespace trimming and normalization is utf8 whitespace aware
  • Minor updates to CSS default whitelist for border attributes