Page MenuHomePhabricator

[Bug] Wikipedia article on the letter "ß" does not load properly.
Open, NormalPublic1 Story Points

Description

As reported by a user, the article on the German letter ß (in enwiki, "Eszett") does not load properly in the app. It looks like the app loads the lead section properly, but then the rest of the sections are from the article "SS" (which is a redirect to "Schutzstaffel"), as well as the title and description.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2016, 2:37 PM

This works fine for me using the REST API. It redirects from https://en.wikipedia.org/api/rest_v1/page/mobile-sections/ß to https://en.wikipedia.org/api/rest_v1/page/mobile-sections/SS

I suspect this is an issue with app not handling redirects not the content service. Should I remove the content service tag?

@Jdlrobson But if I'm not mistaken, that's the wrong behavior. The actual enwiki article [[ß]] does not redirect to [[SS]].

Yep you are right. Some kind of clever vandalism..?

This is actually an upstream issue with Parsoid as https://en.wikipedia.org/api/rest_v1/page/title/ß has the same problem.

@Jdlrobson The problem with the ß letter is quite deep, and we don't really know how to solve it. The problem is that mediawiki normally capitalises the first letter of an article title. In RESTBase/Parsoid we also normalise the title and capitalise the first letter, but since it's done in JavaScript, not PHP, the unicode versions are different, and JS might upper-case something that PHP would not, or upper-case to a different character.

We're not aware of how often these edge-cases happen in reality and not aware of any generic way to fix it.

Although in this case the JS unicode mapping is actually buggy, because the 'ß'.toUpperCase().toLowerCase() === 'ss', so it doesn't round-trip. We might consider including the round-trip test into the title normalisation library, but again, we're not aware if it have to round-trip all the time..

Penma added a subscriber: Penma.Aug 2 2016, 12:01 PM

Thanks @Pchelolo for the background. Make sense.. so no longer a mystery to me. Could we not use charCodeAt and encode the utf8 hex code for anything greater than a certain value?

Tbayer updated the task description. (Show Details)Oct 7 2016, 2:26 PM

Screenshot for illustration:

This is fascinating. It seems we have a similar problem in client-side JavaScript, affecting only Chromium-based browsers so far (T147646).

MediaWiki uses PHP's mb_strtoupper function for capitalizing, which does not change 'ß' when uppercasing. (Looking at the source code, it only seems to be capable of case changes which do not change the length of the string, so no 'ß' → 'SS'.) If your method of uppercasing is smarter than that, it won't work correctly…

Perhaps comparing the length of str and str.toUpperCase() (where str is first letter of page title), and only using str.toUpperCase() if the lengths are the same, would be correct? I'd guess that 'ß' is not the only case (eheh) where this happens.

Perhaps comparing the length of str and str.toUpperCase() (where str is first letter of page title), and only using str.toUpperCase() if the lengths are the same, would be correct? I'd guess that 'ß' is not the only case (eheh) where this happens.

Sounds like a good idea, I've created a PR for the mediawiki-title library that's used on the backend: https://github.com/wikimedia/mediawiki-title/pull/20

ssastry added a subscriber: ssastry.Oct 7 2016, 6:23 PM

Perhaps comparing the length of str and str.toUpperCase() (where str is first letter of page title), and only using str.toUpperCase() if the lengths are the same, would be correct? I'd guess that 'ß' is not the only case (eheh) where this happens.

Sounds like a good idea, I've created a PR for the mediawiki-title library that's used on the backend: https://github.com/wikimedia/mediawiki-title/pull/20

Great! We'll upgrade our mediawiki-title version once that is merged.

@ssastry Could you may be review that PR? There's no 'official reviewer' for that repo except me, and self-merging feels a bit wrong :)

matmarex added a comment.EditedOct 7 2016, 7:15 PM

Here's a possibly non-exhaustive list of characters likely affected by this bug (I haven't tried to uppercase them in PHP and JS, just listed all characters where chr.uppercase().length != chr.length). Perhaps it'd make sense to add a test for them all?

#<U+00DF ß LATIN SMALL LETTER SHARP S utf8:c3,9f>
#<U+0149 ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE utf8:c5,89>
#<U+01F0 ǰ LATIN SMALL LETTER J WITH CARON utf8:c7,b0>
#<U+0390 ΐ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS utf8:ce,90>
#<U+03B0 ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS utf8:ce,b0>
#<U+0587 և ARMENIAN SMALL LIGATURE ECH YIWN utf8:d6,87>
#<U+1E96 ẖ LATIN SMALL LETTER H WITH LINE BELOW utf8:e1,ba,96>
#<U+1E97 ẗ LATIN SMALL LETTER T WITH DIAERESIS utf8:e1,ba,97>
#<U+1E98 ẘ LATIN SMALL LETTER W WITH RING ABOVE utf8:e1,ba,98>
#<U+1E99 ẙ LATIN SMALL LETTER Y WITH RING ABOVE utf8:e1,ba,99>
#<U+1E9A ẚ LATIN SMALL LETTER A WITH RIGHT HALF RING utf8:e1,ba,9a>
#<U+1F50 ὐ GREEK SMALL LETTER UPSILON WITH PSILI utf8:e1,bd,90>
#<U+1F52 ὒ GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA utf8:e1,bd,92>
#<U+1F54 ὔ GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA utf8:e1,bd,94>
#<U+1F56 ὖ GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI utf8:e1,bd,96>
#<U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI utf8:e1,be,80>
#<U+1F81 ᾁ GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI utf8:e1,be,81>
#<U+1F82 ᾂ GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI utf8:e1,be,82>
#<U+1F83 ᾃ GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI utf8:e1,be,83>
#<U+1F84 ᾄ GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI utf8:e1,be,84>
#<U+1F85 ᾅ GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI utf8:e1,be,85>
#<U+1F86 ᾆ GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,86>
#<U+1F87 ᾇ GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,87>
#<U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI utf8:e1,be,88>
#<U+1F89 ᾉ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI utf8:e1,be,89>
#<U+1F8A ᾊ GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI utf8:e1,be,8a>
#<U+1F8B ᾋ GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI utf8:e1,be,8b>
#<U+1F8C ᾌ GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI utf8:e1,be,8c>
#<U+1F8D ᾍ GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI utf8:e1,be,8d>
#<U+1F8E ᾎ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,8e>
#<U+1F8F ᾏ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,8f>
#<U+1F90 ᾐ GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI utf8:e1,be,90>
#<U+1F91 ᾑ GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI utf8:e1,be,91>
#<U+1F92 ᾒ GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI utf8:e1,be,92>
#<U+1F93 ᾓ GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI utf8:e1,be,93>
#<U+1F94 ᾔ GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI utf8:e1,be,94>
#<U+1F95 ᾕ GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI utf8:e1,be,95>
#<U+1F96 ᾖ GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,96>
#<U+1F97 ᾗ GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,97>
#<U+1F98 ᾘ GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI utf8:e1,be,98>
#<U+1F99 ᾙ GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI utf8:e1,be,99>
#<U+1F9A ᾚ GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI utf8:e1,be,9a>
#<U+1F9B ᾛ GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI utf8:e1,be,9b>
#<U+1F9C ᾜ GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI utf8:e1,be,9c>
#<U+1F9D ᾝ GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI utf8:e1,be,9d>
#<U+1F9E ᾞ GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,9e>
#<U+1F9F ᾟ GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,9f>
#<U+1FA0 ᾠ GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI utf8:e1,be,a0>
#<U+1FA1 ᾡ GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI utf8:e1,be,a1>
#<U+1FA2 ᾢ GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI utf8:e1,be,a2>
#<U+1FA3 ᾣ GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI utf8:e1,be,a3>
#<U+1FA4 ᾤ GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI utf8:e1,be,a4>
#<U+1FA5 ᾥ GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI utf8:e1,be,a5>
#<U+1FA6 ᾦ GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,a6>
#<U+1FA7 ᾧ GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,a7>
#<U+1FA8 ᾨ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI utf8:e1,be,a8>
#<U+1FA9 ᾩ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI utf8:e1,be,a9>
#<U+1FAA ᾪ GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI utf8:e1,be,aa>
#<U+1FAB ᾫ GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI utf8:e1,be,ab>
#<U+1FAC ᾬ GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI utf8:e1,be,ac>
#<U+1FAD ᾭ GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI utf8:e1,be,ad>
#<U+1FAE ᾮ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,ae>
#<U+1FAF ᾯ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,af>
#<U+1FB2 ᾲ GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI utf8:e1,be,b2>
#<U+1FB3 ᾳ GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI utf8:e1,be,b3>
#<U+1FB4 ᾴ GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI utf8:e1,be,b4>
#<U+1FB6 ᾶ GREEK SMALL LETTER ALPHA WITH PERISPOMENI utf8:e1,be,b6>
#<U+1FB7 ᾷ GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,b7>
#<U+1FBC ᾼ GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI utf8:e1,be,bc>
#<U+1FC2 ῂ GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI utf8:e1,bf,82>
#<U+1FC3 ῃ GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI utf8:e1,bf,83>
#<U+1FC4 ῄ GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI utf8:e1,bf,84>
#<U+1FC6 ῆ GREEK SMALL LETTER ETA WITH PERISPOMENI utf8:e1,bf,86>
#<U+1FC7 ῇ GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI utf8:e1,bf,87>
#<U+1FCC ῌ GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI utf8:e1,bf,8c>
#<U+1FD2 ῒ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA utf8:e1,bf,92>
#<U+1FD6 ῖ GREEK SMALL LETTER IOTA WITH PERISPOMENI utf8:e1,bf,96>
#<U+1FD7 ῗ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI utf8:e1,bf,97>
#<U+1FE2 ῢ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA utf8:e1,bf,a2>
#<U+1FE4 ῤ GREEK SMALL LETTER RHO WITH PSILI utf8:e1,bf,a4>
#<U+1FE6 ῦ GREEK SMALL LETTER UPSILON WITH PERISPOMENI utf8:e1,bf,a6>
#<U+1FE7 ῧ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI utf8:e1,bf,a7>
#<U+1FF2 ῲ GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI utf8:e1,bf,b2>
#<U+1FF3 ῳ GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI utf8:e1,bf,b3>
#<U+1FF4 ῴ GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI utf8:e1,bf,b4>
#<U+1FF6 ῶ GREEK SMALL LETTER OMEGA WITH PERISPOMENI utf8:e1,bf,b6>
#<U+1FF7 ῷ GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI utf8:e1,bf,b7>
#<U+1FFC ῼ GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI utf8:e1,bf,bc>
#<U+FB00 ff LATIN SMALL LIGATURE FF utf8:ef,ac,80>
#<U+FB01 fi LATIN SMALL LIGATURE FI utf8:ef,ac,81>
#<U+FB02 fl LATIN SMALL LIGATURE FL utf8:ef,ac,82>
#<U+FB03 ffi LATIN SMALL LIGATURE FFI utf8:ef,ac,83>
#<U+FB04 ffl LATIN SMALL LIGATURE FFL utf8:ef,ac,84>
#<U+FB05 ſt LATIN SMALL LIGATURE LONG S T utf8:ef,ac,85>
#<U+FB06 st LATIN SMALL LIGATURE ST utf8:ef,ac,86>
#<U+FB13 ﬓ ARMENIAN SMALL LIGATURE MEN NOW utf8:ef,ac,93>
#<U+FB14 ﬔ ARMENIAN SMALL LIGATURE MEN ECH utf8:ef,ac,94>
#<U+FB15 ﬕ ARMENIAN SMALL LIGATURE MEN INI utf8:ef,ac,95>
#<U+FB16 ﬖ ARMENIAN SMALL LIGATURE VEW NOW utf8:ef,ac,96>
#<U+FB17 ﬗ ARMENIAN SMALL LIGATURE MEN XEH utf8:ef,ac,97>

Generated with the following Ruby script:

require 'unicode_utils'

UnicodeUtils::Codepoint::RANGE.each do |i|
	u = UnicodeUtils::Codepoint.new(i)
	begin
		s = u.to_s
	rescue RangeError
		# U+D800 etc. cause this
		next
	end
	next if UnicodeUtils.nfc(s) != s
	p u if UnicodeUtils.upcase(s).length != s.length
end

(Looks like the library I used only has data for Unicode 6.2, so that list is almost certainly non-exhaustive.)

Awesome @matmarex, but it seems like PHP mb_strtoupper doesn't actually change these (at least the ones I've tried) while JavaScript does. This proves that the length-check I've added in my PR doesn't fix the problem at all.. I'm starting to think that we need to add an explicit list of exceptional characters, but trying to come up with an exhaustive list of problematic characters is an almost impossible task since it would depend on the version of PHP and JS. Maybe just using the list you've made is good enough?

ssastry added a comment.EditedOct 7 2016, 7:53 PM

@Pchelolo, for all those chars, the length check is sufficient to prevent javascript from changing the title then. So, I am confused by what you meant when you say the PR doesn't fix the problem.

@ssastry hm.. Actually you're right, it will help. But we need to modify the test a bit to account for characters in astral planes.

Arlolra added a subscriber: Arlolra.Oct 7 2016, 8:11 PM

What's missing by just doing .replace(/^[a-z]/, (t) => t.toUpperCase());?

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

What's missing by just doing .replace(/^[a-z]/, (t) => t.toUpperCase());?

All of the non a-z lowercase characters in all non-english languages :)

Awesome @matmarex, but it seems like PHP mb_strtoupper doesn't actually change these (at least the ones I've tried) while JavaScript does. This proves that the length-check I've added in my PR doesn't fix the problem at all..

How so? That sounds like the exact opposite, your patch would fix the issue.

Maybe just using the list you've made is good enough?

It is almost certainly missing characters added after Unicode 6.2.

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

How so? That sounds like the exact opposite, your patch would fix the issue.

Ye, right, I was misleaded a bit. I've added your list to the tests in the lib, so we have explicit tests for all of these cases.

I've merged the PR and published the new version. Once all involved parties (parsoid and RESTBase) get deployed we'd need to retest this.

Change 314783 had a related patch set uploaded (by Arlolra):
Bump mediawiki-title for T141723

https://gerrit.wikimedia.org/r/314783

Change 314783 merged by jenkins-bot:
Bump mediawiki-title for T141723

https://gerrit.wikimedia.org/r/314783

Esanders added a subscriber: Esanders.EditedOct 8 2016, 1:06 AM

Here's my comparison. It found 243 cases to Bartosz's 100. But we've both only searched single codepoint characters, so there are likely more.

script:

<?php

echo '"Codepoint", "Character", "mb_strtoupper", "String.toUpperCase"'."\n";

for ( $i = 0; $i < 65536; $i++ ) {
	$char = mb_convert_encoding( '&#' . $i . ';', 'UTF-8', 'HTML-ENTITIES' );
	$php = mb_strtoupper( $char );
	$js = exec( 'node -p \'' . json_encode($char) . '.toUpperCase();\' 2>/dev/null' );
	if ( $js !== '' && $js !== $php ) {
		echo '"\\u', str_pad( dechex( $i ), 4, '0', STR_PAD_LEFT), '", "' , $char, '", "', $php, '", "', $js, '"', "\n";
	}
}

Here's my comparison. It found 243 cases to Bartosz's 100. But we've both only searched single codepoint characters, so there are likely more.

MediaWiki only takes the first codepoint for uppercasing, so I think it's sufficient. See Title::capitalize() and Language::ucfirst(). Hardcoding the list would still be problematic, since new characters are introduced in new Unicode versions.

(I wonder if there are actually any cases where uppercase(character+combining) would differ from uppercase(character)+combining?)

Your list has some examples where the length is identical, but the character is different (e.g. Nj → NJ), so it looks like the length heuristic I proposed earlier and which was implemented is wrong (or at least, doesn't solve all cases).

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

Yeah, the SS mapping would not be popular in Germany..

I almost wonder if it would be worth updating / fixing the PHP unicode support now, rather than painstakingly documenting current limitations. Sure, this would break things right now, but it seems inevitable that some future version of PHP will need to get updated unicode support anyway.

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

Yeah, the SS mapping would not be popular in Germany..

Ah? I was taught in school that these two were equivalent (unlike the addition of an e for umlauts, which seems to be simply a pragmatism).

I almost wonder if it would be worth updating / fixing the PHP unicode support now, rather than painstakingly documenting current limitations. Sure, this would break things right now, but it seems inevitable that some future version of PHP will need to get updated unicode support anyway.

+1 to that idea.

Change 314725 had a related patch set uploaded (by Esanders):
Add exception in mw.Title for 'ß'.toUpperCase() so it matches PHP

https://gerrit.wikimedia.org/r/314725

GWicke added a comment.EditedOct 11 2016, 11:21 PM

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

Yeah, the SS mapping would not be popular in Germany..

Ah? I was taught in school that these two were equivalent (unlike the addition of an e for umlauts, which seems to be simply a pragmatism).

Lower-case 'ss' is used as a stand-in for situations where ß is not available, but capital SS by itself has very different connotations in Germany, and would not be considered a legitimate uppercase of ß (which doesn't exist in practice).

GWicke triaged this task as Normal priority.Oct 12 2016, 5:57 PM
Pchelolo moved this task from doing to done on the Services board.Oct 12 2016, 10:04 PM
Pchelolo edited projects, added Services (done); removed Services (doing).

Everything is done here on the Services side, so it can be closed once desktop fronted side is also fixed

I think this is working properly on Android in 7a04741abae79b23a9e590d969c1d559acf28132.

Change 314725 merged by jenkins-bot:
Add exceptions in mw.Title where mb_strtoupper doesn't match String.toUpperCase

https://gerrit.wikimedia.org/r/314725

matmarex closed this task as Resolved.Nov 23 2016, 6:03 PM
matmarex removed a project: Patch-For-Review.

Sounds like it's fixed everywhere now.

Esanders reopened this task as Open.Mar 26 2019, 1:00 PM
Esanders added a comment.EditedMar 26 2019, 1:15 PM

It looks like mb_strtoupper behaves differently in PHP7, probably due to Unicode updates. I re-ran the build script (provided in the patch below) and got the following diff:

--- a/resources/src/mediawiki.Title/phpCharToUpper.js
+++ b/resources/src/mediawiki.Title/phpCharToUpper.js
@@ -6,15 +6,8 @@
 	var toUpperMapping = {
 		'ß': 'ß',
 		'ʼn': 'ʼn',
-		'Dž': 'Dž',
-		'dž': 'Dž',
-		'Lj': 'Lj',
-		'lj': 'Lj',
-		'Nj': 'Nj',
-		'nj': 'Nj',
 		'ǰ': 'ǰ',
-		'Dz': 'Dz',
-		'dz': 'Dz',
+		'ɪ': 'Ɪ',
 		'ʝ': 'Ʝ',
 		'ͅ': 'ͅ',
 		'ΐ': 'ΐ',
@@ -26,6 +19,15 @@
 		'ᏻ': 'Ᏻ',
 		'ᏼ': 'Ᏼ',
 		'ᏽ': 'Ᏽ',
+		'ᲀ': 'В',
+		'ᲁ': 'Д',
+		'ᲂ': 'О',
+		'ᲃ': 'С',
+		'ᲄ': 'Т',
+		'ᲅ': 'Т',
+		'ᲆ': 'Ъ',
+		'ᲇ': 'Ѣ',
+		'ᲈ': 'Ꙋ',
 		'ẖ': 'ẖ',
 		'ẗ': 'ẗ',
 		'ẘ': 'ẘ',

I verified that in a PHP5.3 environment, mb_strtpupper("Dž") returns Dž, but it PHP7 it returns DŽ (which matches the JS, hence the removal from the list).

This causes another issue, which is that the page https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no becomes unreachable if I enable the PHP7 beta feature. (edit: filed as T219279)

I'm not sure if it's going to be possible for us to have different versions of this script served depending on the host's PHP version?

Change 499196 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/core@master] Title: Add scripts fo generating/updating phpCharToUpper.js

https://gerrit.wikimedia.org/r/499196

There is a recent very similar issue with Georgian letters: T208139

Change 499196 merged by jenkins-bot:
[mediawiki/core@master] Title: Add scripts for generating/updating phpCharToUpper.js

https://gerrit.wikimedia.org/r/499196

Is this now resolved? Follow-up work also exists at T219279, but that has its own ticket.