Page MenuHomePhabricator

mw.ustring.upper and mw.ustring.lower fail to transform certain characters
Closed, DuplicatePublic

Description

On English Wiktionary, the Scribunto functions mw.ustring.upper and mw.ustring.lower are sometimes failing to transform certain code points that have an uppercase mapping. As a consequence, some of the categories for terms spelled with unusual characters, which are added by headword-line templates through Module:headword, which should show characters in uppercase when the uppercase version is not one of the standardChars for the language (see the language data modules for examples), have been alternating between uppercase and lowercase. I've just observed this for ꝑ; the category Category:Latin terms spelled with ꝑ would be Category:Latin terms spelled with Ꝑ if mw.ustring.upper("ꝑ") returned "Ꝑ" as it should. This has also been reported as happening to the character ͷ in a discussion page at Wiktionary:Grease pit/2019/July § ͷοῖκυ.

I'm guessing that this originates in PHP because in the same discussion page category headers have been reported as sometimes displaying the lowercase letters ꜣ, ꜥ instead of the uppercase Ꜣ, Ꜥ. That bug can be seen right now by paging to Ꜣ in Category:Egyptian lemmas, where there are headers for uppercase Ꜣ, lowercase ꜣ, uppercase Ꜥ, and lowercase ꜥ.

Here is a function that will output wikitext if the bug is present:

function test()
    local output = {}
    local function show_casing(letter, func)
        table.insert(output, '* mw.ustring.' .. func .. '("' .. letter .. '") → "' .. mw.ustring[func](letter) .. '"')
    end
    local function assert_casing(lower, upper)
        if mw.ustring.upper(lower) ~= upper then
            show_casing(lower, "upper")
        end
        if mw.ustring.lower(upper) ~= lower then
            show_casing(upper, "lower")
        end
    end
    
    assert_casing("a", "A")
    assert_casing("ç", "Ç")
    assert_casing("ꝑ", "Ꝑ")
    assert_casing("ͷ", "Ͷ")
    assert_casing("ꜣ", "Ꜣ")
    assert_casing("ꜥ", "Ꜥ")
    
    return table.concat(output, "\n")
end

Event Timeline

Erutuon created this task.Jul 27 2019, 3:42 PM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptJul 27 2019, 3:42 PM
Erutuon updated the task description. (Show Details)Jul 27 2019, 3:45 PM
Erutuon updated the task description. (Show Details)Jul 27 2019, 3:48 PM
Erutuon updated the task description. (Show Details)Jul 27 2019, 3:57 PM

Is it a JavaScript error or a MediaWiki PHP/SQL internal error?

Erutuon updated the task description. (Show Details)Jul 27 2019, 5:29 PM

The two bug locations that I've reported here are 1. in Scribunto (the functions mw.ustring.upper and mw.ustring.lower) and 2. in whatever generates the headers in category pages. At least the Scribunto bug involves PHP because mw.ustring.upper and mw.ustring.lower seem to be implemented using the PHP functions mb_strtoupper and mb_strtolower, and categories probably involve PHP as well. If the title and tags need edits, I would appreciate some help as I don't have much energy right now and it is possible I am not doing Phabricator right.

Anomie added a subscriber: Anomie.

You're correct that the cause is that Scribunto's upper- and lowercasing is implemented in terms of PHP's mb_strtoupper and mb_strtolower. That's not likely to change.

However, we are in the process of upgrading to PHP7, which uses a newer version of Unicode that may include case mappings for the characters you're concerned about. See T176370: Migrate to PHP 7 in WMF production for progress on that. At the moment you may find that mw.ustring.upper("ꝑ") returns either or depending on whether your request happens to be served with PHP7 or HHVM.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptAug 15 2019, 5:40 PM
Erutuon added a comment.EditedAug 15 2019, 7:47 PM

However, we are in the process of upgrading to PHP7, which uses a newer version of Unicode that may include case mappings for the characters you're concerned about.

If that's the issue, the version of PHP that is causing these errors must be using a very old version of the Unicode Character Database. All the characters concerned – , , ͷ, Ͷ, , , , – were added in Unicode 5.1 in 2008.

Edit: I see from this comment that HHVM used Unicode 3.2 in February 2018 at least. That would explain it.

Anomie added a comment.EditedAug 15 2019, 7:59 PM

That is indeed the case. The data file was added in September 2002, and wasn't updated again until October 2010 (in time for PHP 5.4). And HHVM never picked up the latter patch, they're still using the 2002 version.

(edits: copy-pasted the wrong revision, then much confusion)