Page MenuHomePhabricator

Parsoid includes control characters that are not allowed in HTML 5
Open, HighPublic

Description

view-source:http://parsoid-lb.eqiad.wikimedia.org/mediawikiwiki/Thread%3AExtension_talk%3ALinkedWiki%2FNotice%3A_Undefined_index%3A_Beschrijving_in_%2Fvar%2Fwww%2Fwikifarm-mw1.19%2Fextensions%2FLinkedWiki%2FLinkedWiki.php_on_line_283%2Freply_%282%29?oldid=777881

includes some control characters that are not allowed by HTML 5. Note it's also not allowed as an entity.

That post has 0x8, a non-space (as defined by HTML 5) control character.

This caused one of the issues with T92303: Convert LQT pages on MediaWiki.org to Flow (tracking). We'll probably work around it on our end, but it would be good for Parsoid to look at this issue.

Event Timeline

Mattflaschen-WMF raised the priority of this task from to Needs Triage.
Mattflaschen-WMF updated the task description. (Show Details)
Mattflaschen-WMF added a project: Parsoid.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2015, 7:44 PM
Mattflaschen-WMF set Security to None.

Change 225260 had a related patch set uploaded (by Mattflaschen):
Don't abort on invalid characters when creating DOM

https://gerrit.wikimedia.org/r/225260

Change 225260 merged by jenkins-bot:
Don't abort on invalid characters when creating DOM

https://gerrit.wikimedia.org/r/225260

This broke XML parsing of the comment in Flow (and Matt's patch works around that), but if such characters appear in article HTML as well, that would break editing those articles (and probably end up truncating them) using VisualEditor in Internet Explorer.

cscott added a subscriber: cscott.Jul 22 2015, 6:44 PM

Hm, well that's unfortunate -- that means that we can't round-trip such content easily. We'll have to wrap it in an empty mw:entity span and mark the actual content in a data-mw attribute or something like that....

Out of curiosity -- that does mediawiki-core do with this character, with and without HTML tidy turned on?

cscott added a comment.EditedJul 22 2015, 7:13 PM

Hm, this is the code from Sanitizer.php:

	/**
	 * Returns true if a given Unicode codepoint is a valid character in XML.
	 * @param int $codepoint
	 * @return bool
	 */
	private static function validateCodepoint( $codepoint ) {
		return $codepoint == 0x09
			|| $codepoint == 0x0a
			|| $codepoint == 0x0d
			|| ( $codepoint >= 0x20 && $codepoint <= 0xd7ff )
			|| ( $codepoint >= 0xe000 && $codepoint <= 0xfffd )
			|| ( $codepoint >= 0x10000 && $codepoint <= 0x10ffff );
	}

This differs from the HTML5 spec; I've filed T106578 with the details.

This is related to, but orthogonal to, the requirements on text content; I'll look at how the PHP parser enforces that next. (Probably similar gaps are present.)

Change 226352 had a related patch set uploaded (by Cscott):
T106079: Update Sanitizer to match legal HTML5 character entities.

https://gerrit.wikimedia.org/r/226352

Catrope triaged this task as High priority.Jul 22 2015, 11:35 PM
Catrope assigned this task to cscott.Jul 23 2015, 12:05 AM

The issue here is that the "invalid" characters are actually permitted by all true HTML5 parsers I could find, and all browsers too. Flow is using an XML parser, and they are in the habit of complaining about things which don't matter. :)

Further, the characters shouldn't really be in wikitext anyway. Some bot is bypassing the PST somehow, that's the only way they could appear in the DB.

We'd like to fix this in Parsoid, but it's a bit complicated. First off, because the characters aren't representable in HTML, we can't round-trip them. Parsoid really prefers to round trip things.

If we just strip them from the input, then we throw off the character offsets which we use for selective serialization. So that's problematic as well.

Stripping them during tokenization is problematic as well, because we'd need to add special-case code to anyplace where we currently have a permissive character class.

The solution is probably to strip them from the *output*, in a DOM post-processing pass or something like that, looking inside text nodes and attributes. That depends on being able to first create a DOM containing the bad characters -- but like I said, HTML5 parsers seem to be fine with this in general.

The issue here is that the "invalid" characters are actually permitted by all true HTML5 parsers I could find, and all browsers too. Flow is using an XML parser, and they are in the habit of complaining about things which don't matter. :)

That's true, but Flow is not the only software that uses an XML parser to parse Parsoid output. VisualEditor when running in Internet Explorer does this too. Parsoid claims that its output is polyglot, which AIUI means that it's both valid HTML5 and valid XML.

It probably is best to strip these as a post-processing pass on the DOM (as in my comment above), but we should think through some issues of tokenization, etc, and verify consistency with PHP etc etc. But we stripped NULLs from wikitext (T159174), we should be able to strip other characters as well.