Page MenuHomePhabricator

Scribunto - mw.ustring.lower and mw.ustring.upper now automatically convert text to NFC, causing Module:grc-translit to fail on English Wiktionary
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Feeding decomposed text into the Scribunto functions mw.ustring.upper or mw.ustring.lower now results in a composed (i.e. form NFC) output. For instance, decomposed a + acute fed into mw.ustring.upper now returns the atomic character Á instead of A + acute

What happens?:

What should have happened instead?:

  • Module:grc-translit should return properly transliterated text.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Theknightwho triaged this task as Unbreak Now! priority.Aug 27 2025, 6:19 PM

Wikitext content is supposed to always be in NFC form. Can you describe further what you are trying to do? It sounds to me like the Greek Transliteration module is perhaps broken? What else is relying on non-normalized text?

This comment was removed by cscott.
cscott renamed this task from Scribunto - mw.ustring.lower and mw.ustring.upper now automatically convert text to NFC, causing major breakage on English Wiktionary to Scribunto - mw.ustring.lower and mw.ustring.upper now automatically convert text to NFC, causing Module:grc-translit to fail on English Wiktionary.Aug 27 2025, 8:47 PM
cscott lowered the priority of this task from Unbreak Now! to Needs Triage.
cscott updated the task description. (Show Details)
cscott updated the task description. (Show Details)

Looks like those scribunto library methods causing breakage call lang:lc & lang:uc .. so, code on lines 2952 & 2987 of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1171712/5/includes/language/Language.php.

Wikitext content is supposed to always be in NFC form.

That doesn't mean an internal string processing function that is acting entirely within Scribunto before even returning text to be rendered as wikitext should automatically normalize everything to NFC. Scribunto has separate functions specifically for that.

To be honest, I can't think of any language in which string processing functions do auto-normalization. Why should this be an exception?

Well, T400057: MW's Title uppercase first letter code can lead to non-NFC titles is an example of the problems caused by non-NFC wikitext. Scribunto is specifically designed for manipulating wikitext. Let's focus on how to fix Module:grc-translit. It appears that it was simply missing some table entries for dealing with the composed forms of certain Greek characters.

It is not only Module:grc-translit , but also any module that manipulates non-NFC text; such as: Module:it-pronunciation, Module:amf-nominal, and many others. They are all broken right now.

Scribunto is not necessarily specifically designed for manipulating wikitext, but for generating it. There are many reasons why internal string processing would want to rely on non-normalized Unicode text.

A good example are combining diacritics. NFC forces diacritics to merge into precomposed characters, which makes say e.g. removing diacritics rather difficult. Without auto-normalization to NFC, you could normalize the text to NFD and remove all codepoints within certain ranges.

Things like this become much harder with auto-normalization. It is fine to auto-normalize before displaying Scribunto outputs as wikitext. Doing so internally for text that stays entirely within Scribunto is not a good idea. Again, Scribunto has functions for normalizing text when that is something that one wants to do.

Note also that this is by all definitions a breaking change, which is why this should be a rather high priority to fix.

Well, scribunto already provides mw.ustring.toNFD so for cases where non-NFC text is needed, it is available. Language::uc() and friends are primarily meant for manipulating wikitext and titles, and NFC output in those cases seems appropriate.

(And I should note that the prior behavior of the PHP code was 'not even NFD form' -- it was taking the first *codepoint* and converting it to uppercase, which isn't guaranteed to be in any sort of normalized form, which is why I don't think reversion to prior behavior is appropriate here. We need the output to be in a specific documented form.)

Wikitext content is supposed to always be in NFC form. Can you describe further what you are trying to do? It sounds to me like the Greek Transliteration module is perhaps broken? What else is relying on non-normalized text?

This isn't related to wikitext. The issue is that these modules are doing case-conversions part-way through their internal processing, which causes breakage if that module has converted the text to NFD for internal processing reasons (e.g. like that Ancient Greek module does). This is very normal for language processing modules that handle transliteration and pronunciation, because it greatly simplifies processing.

This is a massive, breaking change affecting well over 100,000 pages.

Language::uc() and friends are primarily meant for manipulating wikitext and titles, and NFC output in those cases seems appropriate.

Scribunto realistically shouldn't be using these functions in the first place. But it is, and current behavior on modules (and probably not only on the English Wiktionary) rely on it actually doing as is documented, i.e. converting text to lowercase or uppercase, not applying a normalization on top of it.

Well, scribunto already provides mw.ustring.toNFD so for cases where non-NFC text is needed, it is available. Language::uc() and friends are primarily meant for manipulating wikitext and titles, and NFC output in those cases seems appropriate.

Yes it does, but it is extremely inconvenient to force the use of mw.ustring.toNFD afterwards, and more importantly, sometimes the text fed into mw.ustring.lower or mw.ustring.upper may not be in NFC or NFD at that point during internal processing, for various different reasons (e.g. sometimes we want to decompose all characters but keep a select few, such as the letter Й in Russian, which is a separate letter in its own right). This change now makes it much harder to process text in that way, which is extremely annoying.

I'm sorry, it may be annoying for specialist cases, but it makes Scribunto Do The Right Thing for the vast majority of cases where wikitext or titles are being input or output.

The fix for greek transliteration is already in place: https://en.wiktionary.org/w/index.php?title=Module%3Agrc-translit&diff=86455217&oldid=85707340

I'm sorry, but I do not agree that these are "specialist cases". Text processing is a rather common goal, and Scribunto has a variety of functions specifically to support this goal.

Auto-normalization is not "The Right Thing". As I stated earlier, I cannot think of any other language that automatically applies Unicode normalization when performing string manipulating functions that are not inherently tied to normalization.

I would like to remind you that this is a breaking change. The number of pages affected is in at least the six digits. In my opinion, discussing whether and how normalization should apply when there is a breaking change verges on bikeshedding. The main priority should be on reverting this change and then discussing how to best solve the underlying issue.

I was unable to find any breakage in {{it-pr}}, and Module:grc-translit has already been fixed. https://en.wiktionary.org/wiki/Template:amf-ndecl looks fine, too. If you can provide further examples of incorrect output I am happy to work on fixes.

@cscott would it be possible to have two functions, which can be used depending on whether one wishes for the normalization step?

@loaxxere We already do: mw.string.toNFD() will give you the decomposed form if that's what you need.

As I already proposed, the fix should be to revert the original commit and then discuss how to proceed forward. I'm frankly puzzled that it is considered acceptable to make a breaking change of this sort and not an immediate priority to revert it when it is apparent that it causes significant breakage. I have made such breaking changes during my career and had to rather quickly come to terms with the fact that I had to revert them.

@Surjection At this point you haven't provided any examples of current wiktionary pages which are actually broken. I believe all the breakage has now been fixed.

No, it definitely has not been fixed. There are bound to be more modules that relied on the old behavior. This change in behavior is also entirely undocumented, and even if it were to be documented, it is still counter-intuitive to anyone who has worked on text processing in other languages.

I was unable to find any breakage in {{it-pr}}, and Module:grc-translit has already been fixed. If you can provide further examples of incorrect output I am happy to work on fixes.

It's not displaying errors - it's now displaying the wrong output across tens of thousands of pages, and the fix is not simple at all because this affects the deep inner workings of numerous core modules in unpredictable ways, as none of them were designed with this behaviour in mind and case conversions are common.

The fix in PHP is as simple as adding a flag to the internal case conversion function to not nomalise to NFC, and to ensure the Scribunto calls use that flag. Please implement.

Again, please list the pages with incorrect output so we can fix them. You're making pretty sweeping claims about broken pages.

@cscott I am one of the major developers of Wiktionary code, and I have been a software developer since the 1990's. Functions in libraries provide contracts, and you can't simply change a contract like this without expecting major breakage of all users of the code. Suddenly changing a contract because you think it's the "right thing to do" is not in fact the right thing to do; it shows a basic lack of understanding of how software development works. Please revert this change as soon as possible, thank you.

@loaxxere We already do: mw.string.toNFD() will give you the decomposed form if that's what you need.

As already pointed out, some modules rely on partial decomposition. Please just take responsiblity for your breaking change and stop messing us around.

All I am asking for is a proper bug report here: what are the modules and pages that are affected?

Again, please list the pages with incorrect output so we can fix them. You're making pretty sweeping claims about broken pages.

I would say it has already been sufficiently demonstrated that the change causes significant breakage, and that by itself should be sufficient reason to revert it. Modifying a single module in a way that is not necessarily even correct (the text doesn't have to be NFD normalized either!) is hardly the correct approach here.

All I am asking for is a proper bug report here: what are the modules and pages that are affected?

The bug report is that there is a breaking change without any documentation or even forewarning that has caused breakage on numerous pages, and that by itself should be enough reason to revert the change and discuss later how to implement the fix.

Can you provide a single page that is currently broken?

All I am asking for is a proper bug report here: what are the modules and pages that are affected?

You've already been given some above, and jumping into modules that you don't understand because you're assuming NFD is a quick-fix is extremely irresponsible. Please revert your breaking change instead of forcing the rest of us to work around your half-baked solution.

All of the pages that use {{it-pr}}, for example, are generating incorrect output.

Can you provide a single page that is currently broken?

Literally every mainspace page in this list: https://en.wiktionary.org/wiki/Special:WhatLinksHere/Module:it-pronunciation

For example, [[facci]] displays /fàtt͡ʃi/ instead of /'fat.t͡ʃi/.

Pages using the Italian pronunciation module are also broken, e.g. https://en.wiktionary.org/wiki/Facci#Italian. Fixing all the individual modules is not the solution here, reverting the breaking change is.

There is already a somewhat broad feeling of disillusionment with MediaWiki developers in Wiktionary over their seeming lack of regard for the project. That a developer can make a breaking change like this and refuse to take responsibility, instead arguing that it is the right thing to do and that Wiktionary users have always been in the wrong for doing things the same way as they are done in other languages and environments, will certainly not help with this disillusionment.

You have to understand that mw.ustring.lower() and mw.ustring.upper() are used all over the place and none of the existing code expects the output to randomly get passed through NFC.

For example, [[facci]] displays /fàtt͡ʃi/ instead of /'fat.t͡ʃi/.

Thank you. On https://en.wiktionary.org/wiki/facci presumably?

I can guarantee you there is more breakage; we have not had a chance to look into all the places that use mw.ustring.lower() but I guarantee you it's quite large.

@cscott All instances of {{it-pr}} are displaying the wrong pronunciation, as a random example (you could pick any Italian entry with a pronunciation) https://en.wiktionary.org/wiki/faccela#Italian shows /fàtt͡ʃe.la/ while it should be showing something like /'fat.t͡ʃe.la/; this change in behaviour is preventing the module from further processing the string and get the final pronunciation. Italian rhyme categories added based on {{it-pr}} pronunciation are also all broken now because the pronunciation is wrong: that word is now being added to [[Category:Rhymes:Italian/ela/2_syllables]] while it should be in [[Category:Rhymes:Italian/attʃela/3_syllables]].

Additionally there are two pages that use {{it-pr}} that are actually currently erroring:

But even if those are the only pages with an actual error, all pages that use {{it-pr}} are broken pretty much.

@cscott Yes, and many others. I don't understand though why you need to know this instead of just reverting.

For example, [[facci]] displays /fàtt͡ʃi/ instead of /'fat.t͡ʃi/.

Thank you. On https://en.wiktionary.org/wiki/facci presumably?

This also affects anything involving sorting, which involves various case conversions in Module:languages, which affects around 9 million pages. The solution is non-trivial, as it works with a special non-NFC normalisation that's designed to aid sorting.

Change #1182661 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182661

Change #1182661 merged by jenkins-bot:

[mediawiki/core@master] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182661

Scribunto has its own implemetation of ustring.upper/ustring.lower which seems like it wouldn't be affected, but "When the Ustring library is loaded, the mw.ustring.upper() function is implemented as a call to mw.language.getContentLanguage():uc( s )." (https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.language:uc) which is what was causing the issue. This override dates back over a decade, to 0a8757baba98f624f5f5081f64ac23e809689980.

I propose to decouple ustring.upper/lower from mw.language.uc/lc, so that the latter will eventually be NFC normalized again while the former will retain their existing behavior. The az, kaa and tr languages override mw.language.uc/lc with their own customizations (mostly dealing with dotless-i), which I expect your text processing applications won't want to deal with.

In principle, I agree with the decoupling. I have always found it somewhat strange that mw.ustring.upper and mw.ustring.lower are overwritten when the language module is loaded (which authors of module code have practically no control over). However, it might take some thought to figure out how to do this in a non-breaking way, or at least with appropriate warning for downstream users to update their code.

Change #1182664 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@wmf/1.45.0-wmf.16] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182664

ssastry triaged this task as Unbreak Now! priority.

Change #1182664 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.16] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182664

Mentioned in SAL (#wikimedia-operations) [2025-08-27T22:56:56Z] <arlolra@deploy1003> Started scap sync-world: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-27T23:01:19Z] <arlolra@deploy1003> arlolra, cscott: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-27T23:08:36Z] <arlolra@deploy1003> Finished scap sync-world: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]] (duration: 11m 40s)

The revert is now deployed

Additionally there are two pages that use {{it-pr}} that are actually currently erroring:

I purged these pages to verify the errors are gone

Thanks all for your patience as we worked through the issues. Now, there is enough information here for us to figure out how to fix the original bug without breaking wiktionary.