Page MenuHomePhabricator

Bengali digits are shown on Meetei Wikipedia and translatewiki.net
Open, Needs TriagePublic

Description

If you go Special:RecentChanges on the Meetei (Manipuri) Wikipedia, you'll see all the numbers in timestamps and added/removed bytes in the Meetei Mayek script: ꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹. This works as expected.

If I go to Special:RecentChanges with Meetei (mni) UI, I see timestamps in the Meetei Mayek script (꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹), but bytes added/removed in the Bengali script (০১২৩৪৫৬৭৮৯). This is not as expected: Everything is supposed to be in Meetei Mayek.

Something similar happens also on Special:Contributions, in search results, and on Twn Main Page. On the latter, in particular, the statistics on the large top boxes appear in Bengali numerals, and the percents in the small boxes at the bottom appear in Meetei numerals.

Another thing that may be related is that I also noticed that in the translatewiki's sidebar, two lines appear in a mix of Manipuri and Bengali script:

  • মৈতৈলোন্ ꯒꯤ ꯊꯣꯡꯒꯥꯜ
  • মৈতৈলোন্ ꯗꯥ ꯍꯟꯗꯣꯛꯄ

"মৈতৈলোন্" is "Meiteilon", the name of the Meetei language in the Bengali script. The rest of the line is in the Meetei script. Perhaps the language name is taken from CLDR, and CLDR gives data in the Bengali script by default? The Bengali script is indeed used for the Manipuri language, but all the content in translatewiki and the Meetei Wikipedia is in the Meetei script.

These regular expressions may be useful in debugging:

  • Bengali digits: [০-৯]
  • Meetei digits: [꯰-꯹]

Tagging MediaWiki-extensions-CLDR because I suspect it's related. Please remove it if it's not related.

Event Timeline

Nikerabbit added a subscriber: cscott.

@cscott Can you advice on this?

In my investigation, the issue is that digitTransformTable is ignored on translatewiki.net, because NumberFormatter has already changed the arabic digits to bengali digits, thus the strtr fails to conver them. WMF production maybe has older code, which doesn't (yet?) produce bengali digits, thus the digitTransformTable is correctly applied.

A quick ugly fix could be to add second set of overrides from bengali digits, but that could break number un-formatting.

Translatewiki.net:

PHP 7.4.33 (cli) (built: Nov  8 2022 11:40:37) ( NTS )
> (new NumberFormatter( 'mni', NumberFormatter::PATTERN_DECIMAL ))->format( 1234567890 );
= "১২৩৪৫৬৭৮৯০"

WMF production:

PHP 7.4.33 (cli) (built: Nov 18 2022 12:43:20) ( NTS )
>>> (new NumberFormatter( 'mni', NumberFormatter::PATTERN_DECIMAL ))->format( 1234567890 );
=> "1234567890"

This is now happening in WMF production as well. Likely caused by T345561: Upgrade the MediaWiki servers to ICU 67.

Nikerabbit renamed this task from some Bengali digits are shown on Special:Contributions on translatewiki with Meetei UI, but not on the Meetei Wikipedia to Bengali digits are shown on Meetei Wikipedia and translatewiki.net.Nov 6 2023, 3:30 PM

If I'm understanding correctly, PHP's built-in NumberFormatter is doing a conversion and that's breaking MW's own bespoke NumberFormatter?

From Language::formatNumInternal():

		if ( !$noTranslate ) {
			if ( $translateNumerals ) {
				// This is often unnecessary: PHP's NumberFormatter will often
				// do the digit transform itself (T267614)
				$s = $this->digitTransformTable();
				if ( $s ) {
					$number = strtr( $number, $s );
				}
			}

Again, if I'm understanding correctly, PHP "did the digit transform itself", but incorrectly? And so MW's own ::digitTransformTable() is having no effect?
(and that in turn is because PHP/CLDR has a different default script for 'mni' than MW does?)

Neither to be exact. Expected digits are in https://gerrit.wikimedia.org/g/mediawiki/core/+/ab7cbca00ee80f8eac84b266883bec7d710557c6/languages/messages/MessagesMni.php#27.

Basically $digitTransformTable (from MessagesMni.php in this case) gets ignored if NumberFormatter returns transformed digits. Usually this is not a problem, but when NumberFormatter and MediaWiki disagree, NumberFormatter currently wins, even though it would be preferable for MediaWiki to win.

Neither to be exact. Expected digits are in https://gerrit.wikimedia.org/g/mediawiki/core/+/ab7cbca00ee80f8eac84b266883bec7d710557c6/languages/messages/MessagesMni.php#27.

Basically $digitTransformTable (from MessagesMni.php in this case) gets ignored if NumberFormatter returns transformed digits. Usually this is not a problem, but when NumberFormatter and MediaWiki disagree, NumberFormatter currently wins, even though it would be preferable for MediaWiki to win.

It doesn't get ignored, from the code above it is still applied -- it is just having no effect. But if you add both sets of digits to the transform table it will work AFAICT.

@cscott Can you advice on this?
A quick ugly fix could be to add second set of overrides from bengali digits, but that could break number un-formatting.

This seems like the best approach, since both sets of digits *are* valid for mni, we just have a different default variant.
Language::parseFormattedNumber uses array_flip so if you are careful about the order in which the bengali and arabic digits appear in the table, you ought to be able to ensure that array_flip will use the arabic digits when unformatting.

EDIT: The array_flip is the sketchiest part here, if we wanted a "real" fix it might be to add an explicit reverse transform table to better support "unformatting" numbers on wikis with more than one set of localized digits. For the "formatting" direction, LanguageConverter ought to be able to handle conversion between the two localized digits, although we could be more careful about taking the variant into account on the formatNumber call.

Another option is to add a new option to this chunk of code from Language::formatNumInternal():

		if ( !$noSeparators ) {
			$separatorTransformTable = $this->separatorTransformTable();
			$digitGroupingPattern = $this->digitGroupingPattern();
			$code = $this->getCode();
			if ( !( $translateNumerals && $this->langNameUtils->isValidCode( $code ) ) ) {
				$code = 'C'; // POSIX system default locale
			}

which also sets $code = 'C' if $this->bypassPHPlocalization() is set for the language, where the ::bypassPHPlocalization() name could be bikeshedded and improved, of course. This is similar to the $translateNumerals flag except in this case we want numerals translated, just *not by PHP*.

Also mentioning T268203: Set $digitTransformTable to use english-style 0123456789 digits on sdwiki which used a different workaround. I think what @cscott proposed above are better as they don't affect other languages.