As pointed out in T106079, the code in Sanitizer.php:validateCodepoint doesn't exactly match up with the HTML5 spec for valid character entities. The spec says:
The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.
It then defines control characters as:
The control characters are those whose Unicode "General_Category" property has the value "Cc" in the Unicode UnicodeData.txt data file. [UNICODE]
which as of this writing is U+0000 - U+001F (inclusive) and U+007F - U+009F (inclusive).
The spec defines space characters as:
The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).
(But note that U+000D is explicitly prohibited as a numeric entity, even though it is a space character).
The code from Sanitizer.php reads:
/** * Returns true if a given Unicode codepoint is a valid character in XML. * @param int $codepoint * @return bool */ private static function validateCodepoint( $codepoint ) { return $codepoint == 0x09 || $codepoint == 0x0a || $codepoint == 0x0d || ( $codepoint >= 0x20 && $codepoint <= 0xd7ff ) || ( $codepoint >= 0xe000 && $codepoint <= 0xfffd ) || ( $codepoint >= 0x10000 && $codepoint <= 0x10ffff ); }
This differs from the HTML5 spec, in that it allows U+000D and does not allow U+000C. It also allows U+007F - U+009F, which are banned control characters.
The Sanitizer should be brought into compliance with the HTML5 spec.