Page MenuHomePhabricator

Some letters (initial characters of page titles) not being correctly capitalised
Closed, DeclinedPublic

Description

See this example provided by Gorobay@enwiki: http://3v4l.org/WkbBn

The capitalisation data for the character ɱ is missing under HHVM.

For more info, see the discussion: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Automatic_capitalization_of_title-initial_Unicode_characters

Event Timeline

TTO raised the priority of this task from to Needs Triage.
TTO updated the task description. (Show Details)
TTO added projects: WMF-General-or-Unknown, HHVM.

From https://php.net/ChangeLog-5.php#5.3.4:

Mbstring extension: [...] Fixed bug #52981 (Unicode casing table was out-of-date. Updated with UnicodeData-6.0.0d7.txt and included the source of the generator program with the distribution) (Gustavo).

It's likely that HHVM did not get this fix.

Note that there are 47 characters which the most recent versions of PHP and HHVM do not handle.

MZMcBride added subscribers: tstarling, ori.
MZMcBride subscribed.

We ran into this issue when working on title normalization & redirects in JS. In contrast to mbstring, the JS .toUpperCase() function handles these characters well. As a consequence, redirects between differently-cased versions of a title become self-redirects (ex: https://fr.wikipedia.org/wiki/%EA%9E%80?oldid=125284517).

When fixing the mbstring issue, we'll need to keep in mind that lowercase articles permitted through this bug will become inaccessible. We might need to rename those articles, and move existing redirects out of the way.

Krinkle subscribed.

Declining per T192166.

Note that while this task was about PHP5-to-HHVM, a similar issue arose during HHVM-to-PHP72 as well. That issue was eventually tackled at T219279. If we had remembered this report, we could've known it earlier, but oh well.