Page MenuHomePhabricator

Sanitizer entity restrictions should match HTML5
Closed, ResolvedPublic

Description

As pointed out in T106079, the code in Sanitizer.php:validateCodepoint doesn't exactly match up with the HTML5 spec for valid character entities. The spec says:

The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.

It then defines control characters as:

The control characters are those whose Unicode "General_Category" property has the value "Cc" in the Unicode UnicodeData.txt data file. [UNICODE]

which as of this writing is U+0000 - U+001F (inclusive) and U+007F - U+009F (inclusive).
The spec defines space characters as:

The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

(But note that U+000D is explicitly prohibited as a numeric entity, even though it is a space character).

The code from Sanitizer.php reads:

	/**
	 * Returns true if a given Unicode codepoint is a valid character in XML.
	 * @param int $codepoint
	 * @return bool
	 */
	private static function validateCodepoint( $codepoint ) {
		return $codepoint == 0x09
			|| $codepoint == 0x0a
			|| $codepoint == 0x0d
			|| ( $codepoint >= 0x20 && $codepoint <= 0xd7ff )
			|| ( $codepoint >= 0xe000 && $codepoint <= 0xfffd )
			|| ( $codepoint >= 0x10000 && $codepoint <= 0x10ffff );
	}

This differs from the HTML5 spec, in that it allows U+000D and does not allow U+000C. It also allows U+007F - U+009F, which are banned control characters.

The Sanitizer should be brought into compliance with the HTML5 spec.

Event Timeline

cscott claimed this task.
cscott raised the priority of this task from to Medium.
cscott updated the task description. (Show Details)
cscott added a project: MediaWiki-Parser.
cscott added a subscriber: cscott.

Change 226352 had a related patch set uploaded (by Cscott):
T106578: Update Sanitizer to match legal HTML5 character entities.

https://gerrit.wikimedia.org/r/226352

Change 226352 merged by jenkins-bot:
T106578: Update Sanitizer to match legal HTML5 character entities.

https://gerrit.wikimedia.org/r/226352

Arguably that is the wrong part of the spec to refer to. 12.2.4.69 Tokenizing character references does allow for U+0080 - U+009F.

@Tgr good point, although T113194 illustrates the difficulty of allowing those characters. This patch has been merged for a while, I'm going to close it.

Change 504621 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Ensure PHP and JS are consistent wrt allowed entities

https://gerrit.wikimedia.org/r/504621

Change 504621 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Ensure PHP and JS are consistent wrt allowed entities

https://gerrit.wikimedia.org/r/504621