Page MenuHomePhabricator

Wikitext includes control characters that are not allowed in HTML 5
Open, LowPublic

Description

view-source:http://parsoid-lb.eqiad.wikimedia.org/mediawikiwiki/Thread%3AExtension_talk%3ALinkedWiki%2FNotice%3A_Undefined_index%3A_Beschrijving_in_%2Fvar%2Fwww%2Fwikifarm-mw1.19%2Fextensions%2FLinkedWiki%2FLinkedWiki.php_on_line_283%2Freply_%282%29?oldid=777881

includes some control characters that are not allowed by HTML 5. Note it's also not allowed as an entity.

That post has 0x8, a non-space (as defined by HTML 5) control character.

This caused one of the issues with T92303: Convert LQT pages on MediaWiki.org to Flow (tracking). We'll probably work around it on our end, but it would be good for the parsing team to look at this issue.

UPDATE 2020-04-10: both the legacy wikitext parser and Parsoid allow these characters into HTML, and they cause pedantic errors but are allowed into the DOM; see discussion below.

Event Timeline

Mattflaschen-WMF raised the priority of this task from to Needs Triage.
Mattflaschen-WMF updated the task description. (Show Details)
Mattflaschen-WMF added a project: Parsoid.

Change 225260 had a related patch set uploaded (by Mattflaschen):
Don't abort on invalid characters when creating DOM

https://gerrit.wikimedia.org/r/225260

Change 225260 merged by jenkins-bot:
Don't abort on invalid characters when creating DOM

https://gerrit.wikimedia.org/r/225260

This broke XML parsing of the comment in Flow (and Matt's patch works around that), but if such characters appear in article HTML as well, that would break editing those articles (and probably end up truncating them) using VisualEditor in Internet Explorer.

Hm, well that's unfortunate -- that means that we can't round-trip such content easily. We'll have to wrap it in an empty mw:entity span and mark the actual content in a data-mw attribute or something like that....

Out of curiosity -- that does mediawiki-core do with this character, with and without HTML tidy turned on?

Hm, this is the code from Sanitizer.php:

	/**
	 * Returns true if a given Unicode codepoint is a valid character in XML.
	 * @param int $codepoint
	 * @return bool
	 */
	private static function validateCodepoint( $codepoint ) {
		return $codepoint == 0x09
			|| $codepoint == 0x0a
			|| $codepoint == 0x0d
			|| ( $codepoint >= 0x20 && $codepoint <= 0xd7ff )
			|| ( $codepoint >= 0xe000 && $codepoint <= 0xfffd )
			|| ( $codepoint >= 0x10000 && $codepoint <= 0x10ffff );
	}

This differs from the HTML5 spec; I've filed T106578 with the details.

This is related to, but orthogonal to, the requirements on text content; I'll look at how the PHP parser enforces that next. (Probably similar gaps are present.)

Change 226352 had a related patch set uploaded (by Cscott):
T106079: Update Sanitizer to match legal HTML5 character entities.

https://gerrit.wikimedia.org/r/226352

The issue here is that the "invalid" characters are actually permitted by all true HTML5 parsers I could find, and all browsers too. Flow is using an XML parser, and they are in the habit of complaining about things which don't matter. :)

Further, the characters shouldn't really be in wikitext anyway. Some bot is bypassing the PST somehow, that's the only way they could appear in the DB.

We'd like to fix this in Parsoid, but it's a bit complicated. First off, because the characters aren't representable in HTML, we can't round-trip them. Parsoid really prefers to round trip things.

If we just strip them from the input, then we throw off the character offsets which we use for selective serialization. So that's problematic as well.

Stripping them during tokenization is problematic as well, because we'd need to add special-case code to anyplace where we currently have a permissive character class.

The solution is probably to strip them from the *output*, in a DOM post-processing pass or something like that, looking inside text nodes and attributes. That depends on being able to first create a DOM containing the bad characters -- but like I said, HTML5 parsers seem to be fine with this in general.

The issue here is that the "invalid" characters are actually permitted by all true HTML5 parsers I could find, and all browsers too. Flow is using an XML parser, and they are in the habit of complaining about things which don't matter. :)

That's true, but Flow is not the only software that uses an XML parser to parse Parsoid output. VisualEditor when running in Internet Explorer does this too. Parsoid claims that its output is polyglot, which AIUI means that it's both valid HTML5 and valid XML.

It probably is best to strip these as a post-processing pass on the DOM (as in my comment above), but we should think through some issues of tokenization, etc, and verify consistency with PHP etc etc. But we stripped NULLs from wikitext (T159174), we should be able to strip other characters as well.

LGoto lowered the priority of this task from High to Low.Mar 20 2020, 4:23 PM
LGoto moved this task from Backlog to Needs Investigation on the Parsoid board.

The HTML spec has been reorganized; the latest definition of what is allowed in Text nodes is at https://html.spec.whatwg.org/multipage/dom.html#text-content and reads:

Text nodes and attribute values must consist of scalar values, excluding noncharacters, and controls other than ASCII whitespace.

with the following definitions:

A scalar value is a code point that is not a surrogate.
A surrogate is a code point that is in the range U+D800 to U+DFFF, inclusive.
A noncharacter is a code point that is in the range U+FDD0 to U+FDEF, inclusive, or U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, or U+10FFFF.
ASCII whitespace is U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE.
A control is a C0 control or a code point in the range U+007F DELETE to U+009F APPLICATION PROGRAM COMMAND, inclusive.
A C0 control is a code point in the range U+0000 NULL to U+001F INFORMATION SEPARATOR ONE, inclusive.

If you put all that together, you'd say you need to strip the following from wikitext:

[\u0000-\u0008\u000B\u000F-\u001F\u007F-\u009F\uD800-\uDFFF\uFDD0-\uFDEF\uFFFE\uFFFF\u1FFFE\u1FFFF\u2FFFE\u2FFFF\u3FFFE\u3FFFF\u4FFFE\u4FFFF\u5FFFE\u5FFFF\u6FFFE\u6FFFF\u7FFFE\u7FFFF\u8FFFE\u8FFFF\u9FFFE\u9FFFF\uAFFFE\uAFFFF\uBFFFE\uBFFFF\uCFFFE\uCFFFF\uDFFFE\uDFFFF\uEFFFE\uEFFFF\uFFFFE\uFFFFF\u10FFFE\u10FFFF]

Remex *scans for* these: https://github.com/wikimedia/remex-html/blob/master/RemexHtml/Tokenizer/Tokenizer.php#L290

But we don't actually replace/strip anything except for U+0000:

This matches the unicode spec, which raises an error but generally allows these characters into the DOM, see:

So "not allowed in HTML5" is a bit of a slippery slope. Yes, technically we raise an error, and it could interfere with parsing our output with a strict XML parser, but the official HTML5 spec frowns but allows these characters into the DOM.

I'm going to reclassify this as a "feature request" and remove "Parsoid" from the title, since the behavior of Parsoid and the legacy parser is identical (although I still need to verify T159174 aka that Parsoid is actually handling nulls the same was as core is).

My suggestion is to strip the bad characters in the PST, but accept them in existing wikitext. The PST shouldn't be applied to binary data (although it is applied to JavaScript, CSS, etc) so this should be relatively safe. Maybe we can emit a tracking category if the bad characters are found in wikitext during parsing.

cscott renamed this task from Parsoid includes control characters that are not allowed in HTML 5 to Wikitext includes control characters that are not allowed in HTML 5.Apr 10 2020, 3:44 PM
cscott added a project: MediaWiki-Parser.
cscott updated the task description. (Show Details)
cscott updated the task description. (Show Details)
cscott moved this task from Needs Investigation to Feature requests on the Parsoid board.
cscott moved this task from Backlog to Wikitext new features on the MediaWiki-Parser board.

Just a note that on some browsers VisualEditor uses an XML parser to parse Parsoid output, so this is potentially an issue when editing those bad pages. In theory you could probably smuggle the bad characters past the XML parser just by encoding them as character references.

Another wrinkle is that naively stripping control characters from the input causes TSR/DSR problems, because the text in the output nodes doesn't line up with the wikitext in the input. One solution is to treat the control characters as "virtual entities" and create <span typeof="mw:Entity"> nodes for them. The nodes can even be empty, but having a placeholder in the DOM ensures that stripping them from the output doesn't cause character-offset issues.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)