Page MenuHomePhabricator

JS mw.Title does not strip Unicode bidi control characters from input, unlike PHP MediaWikiTitleCodec
Closed, ResolvedPublic

Description

JS mw.Title does not strip Unicode bidi control characters from input, unlike PHP MediaWikiTitleCodec. I'm also not convinced that it handles whitespace characters correctly.

MediaWikiTitleCodec::splitTitleString:

		# Strip Unicode bidi override characters.
		# Sometimes they slip into cut-n-pasted page titles, where the
		# override chars get included in list displays.
		$dbkey = preg_replace( '/\xE2\x80[\x8E\x8F\xAA-\xAE]/S', '', $dbkey );

		# Clean up whitespace
		# Note: use of the /u option on preg_replace here will cause
		# input with invalid UTF-8 sequences to be nullified out in PHP 5.2.x,
		# conveniently disabling them.
		$dbkey = preg_replace(
			'/[ _\xA0\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]+/u',
			'_',
			$dbkey
		);
		$dbkey = trim( $dbkey, '_' );

mediawiki.Title#parse:

		title = title
			// Normalise whitespace to underscores and remove duplicates
			.replace( /[ _\s]+/g, '_' )
			// Trim underscores
			.replace( rUnderscoreTrim, '' );

Event Timeline

Change 306493 had a related patch set uploaded (by Bartosz Dziewoński):
mw.Title: Correct handling of Unicode whitespace and bidi control characters

https://gerrit.wikimedia.org/r/306493

matmarex triaged this task as Medium priority.

Change 306493 merged by jenkins-bot:
mw.Title: Correct handling of Unicode whitespace and bidi control characters

https://gerrit.wikimedia.org/r/306493