Page MenuHomePhabricator

Sanitizer entity restrictions should match HTML5
Closed, ResolvedPublic

Description

As pointed out in T106079, the code in Sanitizer.php:validateCodepoint doesn't exactly match up with the HTML5 spec for valid character entities. The spec says:

The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.

It then defines control characters as:

The control characters are those whose Unicode "General_Category" property has the value "Cc" in the Unicode UnicodeData.txt data file. [UNICODE]

which as of this writing is U+0000 - U+001F (inclusive) and U+007F - U+009F (inclusive).
The spec defines space characters as:

The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

(But note that U+000D is explicitly prohibited as a numeric entity, even though it is a space character).

The code from Sanitizer.php reads:

	/**
	 * Returns true if a given Unicode codepoint is a valid character in XML.
	 * @param int $codepoint
	 * @return bool
	 */
	private static function validateCodepoint( $codepoint ) {
		return $codepoint == 0x09
			|| $codepoint == 0x0a
			|| $codepoint == 0x0d
			|| ( $codepoint >= 0x20 && $codepoint <= 0xd7ff )
			|| ( $codepoint >= 0xe000 && $codepoint <= 0xfffd )
			|| ( $codepoint >= 0x10000 && $codepoint <= 0x10ffff );
	}

This differs from the HTML5 spec, in that it allows U+000D and does not allow U+000C. It also allows U+007F - U+009F, which are banned control characters.

The Sanitizer should be brought into compliance with the HTML5 spec.

Event Timeline

cscott created this task.Jul 22 2015, 7:11 PM
cscott claimed this task.
cscott raised the priority of this task from to Normal.
cscott updated the task description. (Show Details)
cscott added a project: MediaWiki-Parser.
cscott added a subscriber: cscott.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 22 2015, 7:11 PM

Change 226352 had a related patch set uploaded (by Cscott):
T106578: Update Sanitizer to match legal HTML5 character entities.

https://gerrit.wikimedia.org/r/226352

Change 226352 merged by jenkins-bot:
T106578: Update Sanitizer to match legal HTML5 character entities.

https://gerrit.wikimedia.org/r/226352

Tgr added a subscriber: Tgr.Sep 22 2015, 11:26 PM

Arguably that is the wrong part of the spec to refer to. 12.2.4.69 Tokenizing character references does allow for U+0080 - U+009F.

cscott closed this task as Resolved.Dec 1 2017, 9:27 PM

@Tgr good point, although T113194 illustrates the difficulty of allowing those characters. This patch has been merged for a while, I'm going to close it.

Change 504621 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Ensure PHP and JS are consistent wrt allowed entities

https://gerrit.wikimedia.org/r/504621

Change 504621 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Ensure PHP and JS are consistent wrt allowed entities

https://gerrit.wikimedia.org/r/504621