Page MenuHomePhabricator

JS mw.Title does not strip Unicode bidi control characters from input, unlike PHP MediaWikiTitleCodec
Closed, ResolvedPublic

Description

JS mw.Title does not strip Unicode bidi control characters from input, unlike PHP MediaWikiTitleCodec. I'm also not convinced that it handles whitespace characters correctly.

MediaWikiTitleCodec::splitTitleString:

		# Strip Unicode bidi override characters.
		# Sometimes they slip into cut-n-pasted page titles, where the
		# override chars get included in list displays.
		$dbkey = preg_replace( '/\xE2\x80[\x8E\x8F\xAA-\xAE]/S', '', $dbkey );

		# Clean up whitespace
		# Note: use of the /u option on preg_replace here will cause
		# input with invalid UTF-8 sequences to be nullified out in PHP 5.2.x,
		# conveniently disabling them.
		$dbkey = preg_replace(
			'/[ _\xA0\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]+/u',
			'_',
			$dbkey
		);
		$dbkey = trim( $dbkey, '_' );

mediawiki.Title#parse:

		title = title
			// Normalise whitespace to underscores and remove duplicates
			.replace( /[ _\s]+/g, '_' )
			// Trim underscores
			.replace( rUnderscoreTrim, '' );