Page MenuHomePhabricator

Incorrectly truncated multibyte UTF-8 char
Closed, ResolvedPublic


change of preg_match, preg_replace in checkTitleEncoding

Problem: some links en Russian language interface are very long, example category page link like

looks like

"from" parameter is often truncated at the middle of multibyte char

getGPCVal function in WebRequest.php uses checkTitleEncoding

checkTitleEncoding function of Language.php uses

preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' .

'[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s );

to check is string in UTF8 or not.

But rests of incorrectly truncated multibyte UTF-8 char in the end of the string do not match this regexp.

So checkTitleEncoding wrongly converts truncated UTF-8 line to fallback8bitEncoding.

As a result, link "next 200 pages" on following category page of Russian Wikisource works incorrectly.Категория:Поэзия_Максимилиана_Александровича_Волошина

Some articles of the category are not visible neither on the first, nor on the second category page.

I suggest to change regular expression to consider possible scraps of UTF codes of chars in the end of a line

Version: 1.12.x
Severity: minor




Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:57 PM
bzimport set Reference to bz12444.
bzimport added a subscriber: Unknown Object (MLST).

Why is the from truncated? Is there some kind of limit? Wouldn't it be broken anyway even if the encoding is correct?

Cannot reproduce anymore with the example category.