Page MenuHomePhabricator

Fallback.php doesn't detect the character boundaries for multi-codepoint characters
Open, LowPublic

Description

The following call fails to detect the diacritic mark that is part of the first letter.

print Fallback::mb_substr("i̇zmir", 0, 1);

The first 'i' letter has the diacritic mark, with overall utf8 representation 69 cc 87:

U+0069 LATIN SMALL LETTER I character
U+0307 COMBINING DOT ABOVE character (̇)

Above Fallback::mb_substr only returns U+0069, and not the diacritic mark. Please note that capitalized version of 'i̇' is only one codepoint.

Since this function is used to capitalize the first letter of wiki title, in case when PHP doesn't have mbstring support, this string will be incorrectly capitalized when supplied as wiki title.

Event Timeline

Yuri271 raised the priority of this task from to Needs Triage.
Yuri271 updated the task description. (Show Details)
Yuri271 changed Security from none to None.
Yuri271 subscribed.

Which codebase is this about? / In which module is the file Fallback.php included?

I get the same result with or without the mbstring and intl extensions enabled, so I don't think this particular problem is in the Fallback class. Rather, Title::capitalize() uses $wgContLang->ucfirst(), which extracts the first code point and uppercases it without applying Unicode normalization afterward.

To reproduce, create a page named "İzmir" with the wikitext "[[i̇zmir]]", though not in a completely case-sensitive namespace.

  • Expected result: "i̇zmir" appears in bold because it is the text of a self-link.
  • Actual result (when $wgLanguageCode is not 'tr', 'az', etc.): The link is a red link. When you click it, WebRequest applies Unicode normalization to the provided title string, so you do end up on the same page.

I couldn't verify if this works with mbstring, just assumed it did, and only verified it in Fallback.php by running it.
Now you are confirming the same problem with mbstring.

Behavior actually shouldn't depend on $wgLanguageCode because Unicode transformations don't depend on the language.

For the record, this isn't the simple problem and even libicu doesn't allow to title case only the first letter easily.

Behavior actually shouldn't depend on $wgLanguageCode because Unicode transformations don't depend on the language.

Some do have to be tailored for some languages. In Turkish and some others, this includes casing. See page 152 and pages 235–236 of the Unicode Standard, Version 7.0 – Core Specification.

For the record, this isn't the simple problem and even libicu doesn't allow to title case only the first letter easily.

Failing to normalize the result can be a problem even for ordinary toUppercase()/toLowercase() transformations (see pages 239–240).

More generally, UAX #15, section 1.4 states, "In using normalization functions, it is important to realize that none of the Normalization Forms are closed under string concatenation. That is, even if two strings X and Y are normalized, their string concatenation X+Y is not guaranteed to be normalized."