[Bug] Wikipedia article on the letter "ß" does not load properly.
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	Dbrant
	Jul 31 2016, 2:37 PM

Description

As reported by a user, the article on the German letter ß (in enwiki, "Eszett") does not load properly in the app. It looks like the app loads the lead section properly, but then the rest of the sections are from the article "SS" (which is a redirect to "Schutzstaffel"), as well as the title and description.

Details

	Subject	Repo	Branch	Lines +/-
	Title: Add scripts for generating/updating phpCharToUpper.js	mediawiki/core	master	+44 -0
	Add exceptions in mw.Title where mb_strtoupper doesn't match String.toUpperCase	mediawiki/core	master	+270 -3

Customize query in gerrit

Related Objects

Mentioned In: T297342: Expose phpCharToUpper map for title normalization via the API
T208139: Georgian words are automatically (incorrectly) capitalized when entered
T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes
T133320: Unified extension registration mechanism for core/VE/Parsoid
T48580: Create a VisualEditor plugin to integrate with ProofreadPage
T147742: Template generated data-mw include incomplete href
T141905: Parsoid crashes from logs
T119228: Switch Parsoid to use node > 0.10 (most likely node 4.x)
T149241: Unknown contentmodel wikibase-item
T147646: `( new mw.Title('ß') ).getPrefixedText()` does not round-trip in Chromium
Mentioned Here: T208139: Georgian words are automatically (incorrectly) capitalized when entered
T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes
rAPAW7a04741abae7: Fix PasswordTextInput's EditText RTL layout
rGPAR173d7e321717: Get rid of simple debug helpers
T48580: Create a VisualEditor plugin to integrate with ProofreadPage
T119228: Switch Parsoid to use node > 0.10 (most likely node 4.x)
T133320: Unified extension registration mechanism for core/VE/Parsoid
T141905: Parsoid crashes from logs
T147742: Template generated data-mw include incomplete href
T149241: Unknown contentmodel wikibase-item
T147646: `( new mw.Title('ß') ).getPrefixedText()` does not round-trip in Chromium

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This works fine for me using the REST API. It redirects from https://en.wikipedia.org/api/rest_v1/page/mobile-sections/ß to https://en.wikipedia.org/api/rest_v1/page/mobile-sections/SS

I suspect this is an issue with app not handling redirects not the content service. Should I remove the content service tag?

@Jdlrobson But if I'm not mistaken, that's the wrong behavior. The actual enwiki article [[ß]] does not redirect to [[SS]].

Yep you are right. Some kind of clever vandalism..?

This is actually an upstream issue with Parsoid as https://en.wikipedia.org/api/rest_v1/page/title/ß has the same problem.

@Jdlrobson The problem with the ß letter is quite deep, and we don't really know how to solve it. The problem is that mediawiki normally capitalises the first letter of an article title. In RESTBase/Parsoid we also normalise the title and capitalise the first letter, but since it's done in JavaScript, not PHP, the unicode versions are different, and JS might upper-case something that PHP would not, or upper-case to a different character.

We're not aware of how often these edge-cases happen in reality and not aware of any generic way to fix it.

Although in this case the JS unicode mapping is actually buggy, because the 'ß'.toUpperCase().toLowerCase() === 'ss', so it doesn't round-trip. We might consider including the round-trip test into the title normalisation library, but again, we're not aware if it have to round-trip all the time..

Penma subscribed.Aug 2 2016, 12:01 PM

Thanks @Pchelolo for the background. Make sense.. so no longer a mystery to me. Could we not use charCodeAt and encode the utf8 hex code for anything greater than a certain value?

Dbrant moved this task from Needs Triage to Tracking on the Wikipedia-Android-App-Backlog board.Aug 10 2016, 7:00 PM

• bearND moved this task from Incoming to Tracking on the Mobile-Content-Service board.Aug 11 2016, 4:48 PM

Dbrant merged a task: T147623: "ß" (Eszett) article mixed up with the "SS" (Schutzstaffel) article.Oct 7 2016, 2:00 PM

Dbrant added a subscriber: • Tbayer.

Screenshot for illustration:

Android app SS bug.png (1×714 px, 129 KB)

matmarex mentioned this in T147646: `( new mw.Title('ß') ).getPrefixedText()` does not round-trip in Chromium.Oct 7 2016, 2:27 PM

This is fascinating. It seems we have a similar problem in client-side JavaScript, affecting only Chromium-based browsers so far (T147646).

MediaWiki uses PHP's mb_strtoupper function for capitalizing, which does not change 'ß' when uppercasing. (Looking at the source code, it only seems to be capable of case changes which do not change the length of the string, so no 'ß' → 'SS'.) If your method of uppercasing is smarter than that, it won't work correctly…

Perhaps comparing the length of str and str.toUpperCase() (where str is first letter of page title), and only using str.toUpperCase() if the lengths are the same, would be correct? I'd guess that 'ß' is not the only case (eheh) where this happens.

• Pchelolo added a project: Services.Oct 7 2016, 6:09 PM

Perhaps comparing the length of str and str.toUpperCase() (where str is first letter of page title), and only using str.toUpperCase() if the lengths are the same, would be correct? I'd guess that 'ß' is not the only case (eheh) where this happens.

Sounds like a good idea, I've created a PR for the mediawiki-title library that's used on the backend: https://github.com/wikimedia/mediawiki-title/pull/20

In T141723#2700039, @Pchelolo wrote:

Perhaps comparing the length of str and str.toUpperCase() (where str is first letter of page title), and only using str.toUpperCase() if the lengths are the same, would be correct? I'd guess that 'ß' is not the only case (eheh) where this happens.

Sounds like a good idea, I've created a PR for the mediawiki-title library that's used on the backend: https://github.com/wikimedia/mediawiki-title/pull/20

Great! We'll upgrade our mediawiki-title version once that is merged.

@ssastry Could you may be review that PR? There's no 'official reviewer' for that repo except me, and self-merging feels a bit wrong :)

Here's a possibly non-exhaustive list of characters likely affected by this bug (I haven't tried to uppercase them in PHP and JS, just listed all characters where chr.uppercase().length != chr.length). Perhaps it'd make sense to add a test for them all?

#<U+00DF ß LATIN SMALL LETTER SHARP S utf8:c3,9f>
#<U+0149 ŉ LATIN SMALL LETTER N PRECEDED BY APOSTROPHE utf8:c5,89>
#<U+01F0 ǰ LATIN SMALL LETTER J WITH CARON utf8:c7,b0>
#<U+0390 ΐ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS utf8:ce,90>
#<U+03B0 ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS utf8:ce,b0>
#<U+0587 և ARMENIAN SMALL LIGATURE ECH YIWN utf8:d6,87>
#<U+1E96 ẖ LATIN SMALL LETTER H WITH LINE BELOW utf8:e1,ba,96>
#<U+1E97 ẗ LATIN SMALL LETTER T WITH DIAERESIS utf8:e1,ba,97>
#<U+1E98 ẘ LATIN SMALL LETTER W WITH RING ABOVE utf8:e1,ba,98>
#<U+1E99 ẙ LATIN SMALL LETTER Y WITH RING ABOVE utf8:e1,ba,99>
#<U+1E9A ẚ LATIN SMALL LETTER A WITH RIGHT HALF RING utf8:e1,ba,9a>
#<U+1F50 ὐ GREEK SMALL LETTER UPSILON WITH PSILI utf8:e1,bd,90>
#<U+1F52 ὒ GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA utf8:e1,bd,92>
#<U+1F54 ὔ GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA utf8:e1,bd,94>
#<U+1F56 ὖ GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI utf8:e1,bd,96>
#<U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI utf8:e1,be,80>
#<U+1F81 ᾁ GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI utf8:e1,be,81>
#<U+1F82 ᾂ GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI utf8:e1,be,82>
#<U+1F83 ᾃ GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI utf8:e1,be,83>
#<U+1F84 ᾄ GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI utf8:e1,be,84>
#<U+1F85 ᾅ GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI utf8:e1,be,85>
#<U+1F86 ᾆ GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,86>
#<U+1F87 ᾇ GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,87>
#<U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI utf8:e1,be,88>
#<U+1F89 ᾉ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI utf8:e1,be,89>
#<U+1F8A ᾊ GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI utf8:e1,be,8a>
#<U+1F8B ᾋ GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI utf8:e1,be,8b>
#<U+1F8C ᾌ GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI utf8:e1,be,8c>
#<U+1F8D ᾍ GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI utf8:e1,be,8d>
#<U+1F8E ᾎ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,8e>
#<U+1F8F ᾏ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,8f>
#<U+1F90 ᾐ GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI utf8:e1,be,90>
#<U+1F91 ᾑ GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI utf8:e1,be,91>
#<U+1F92 ᾒ GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI utf8:e1,be,92>
#<U+1F93 ᾓ GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI utf8:e1,be,93>
#<U+1F94 ᾔ GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI utf8:e1,be,94>
#<U+1F95 ᾕ GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI utf8:e1,be,95>
#<U+1F96 ᾖ GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,96>
#<U+1F97 ᾗ GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,97>
#<U+1F98 ᾘ GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI utf8:e1,be,98>
#<U+1F99 ᾙ GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI utf8:e1,be,99>
#<U+1F9A ᾚ GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI utf8:e1,be,9a>
#<U+1F9B ᾛ GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI utf8:e1,be,9b>
#<U+1F9C ᾜ GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI utf8:e1,be,9c>
#<U+1F9D ᾝ GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI utf8:e1,be,9d>
#<U+1F9E ᾞ GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,9e>
#<U+1F9F ᾟ GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,9f>
#<U+1FA0 ᾠ GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI utf8:e1,be,a0>
#<U+1FA1 ᾡ GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI utf8:e1,be,a1>
#<U+1FA2 ᾢ GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI utf8:e1,be,a2>
#<U+1FA3 ᾣ GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI utf8:e1,be,a3>
#<U+1FA4 ᾤ GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI utf8:e1,be,a4>
#<U+1FA5 ᾥ GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI utf8:e1,be,a5>
#<U+1FA6 ᾦ GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,a6>
#<U+1FA7 ᾧ GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,a7>
#<U+1FA8 ᾨ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI utf8:e1,be,a8>
#<U+1FA9 ᾩ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI utf8:e1,be,a9>
#<U+1FAA ᾪ GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI utf8:e1,be,aa>
#<U+1FAB ᾫ GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI utf8:e1,be,ab>
#<U+1FAC ᾬ GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI utf8:e1,be,ac>
#<U+1FAD ᾭ GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI utf8:e1,be,ad>
#<U+1FAE ᾮ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,ae>
#<U+1FAF ᾯ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI utf8:e1,be,af>
#<U+1FB2 ᾲ GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI utf8:e1,be,b2>
#<U+1FB3 ᾳ GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI utf8:e1,be,b3>
#<U+1FB4 ᾴ GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI utf8:e1,be,b4>
#<U+1FB6 ᾶ GREEK SMALL LETTER ALPHA WITH PERISPOMENI utf8:e1,be,b6>
#<U+1FB7 ᾷ GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI utf8:e1,be,b7>
#<U+1FBC ᾼ GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI utf8:e1,be,bc>
#<U+1FC2 ῂ GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI utf8:e1,bf,82>
#<U+1FC3 ῃ GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI utf8:e1,bf,83>
#<U+1FC4 ῄ GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI utf8:e1,bf,84>
#<U+1FC6 ῆ GREEK SMALL LETTER ETA WITH PERISPOMENI utf8:e1,bf,86>
#<U+1FC7 ῇ GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI utf8:e1,bf,87>
#<U+1FCC ῌ GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI utf8:e1,bf,8c>
#<U+1FD2 ῒ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA utf8:e1,bf,92>
#<U+1FD6 ῖ GREEK SMALL LETTER IOTA WITH PERISPOMENI utf8:e1,bf,96>
#<U+1FD7 ῗ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI utf8:e1,bf,97>
#<U+1FE2 ῢ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA utf8:e1,bf,a2>
#<U+1FE4 ῤ GREEK SMALL LETTER RHO WITH PSILI utf8:e1,bf,a4>
#<U+1FE6 ῦ GREEK SMALL LETTER UPSILON WITH PERISPOMENI utf8:e1,bf,a6>
#<U+1FE7 ῧ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI utf8:e1,bf,a7>
#<U+1FF2 ῲ GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI utf8:e1,bf,b2>
#<U+1FF3 ῳ GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI utf8:e1,bf,b3>
#<U+1FF4 ῴ GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI utf8:e1,bf,b4>
#<U+1FF6 ῶ GREEK SMALL LETTER OMEGA WITH PERISPOMENI utf8:e1,bf,b6>
#<U+1FF7 ῷ GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI utf8:e1,bf,b7>
#<U+1FFC ῼ GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI utf8:e1,bf,bc>
#<U+FB00 ﬀ LATIN SMALL LIGATURE FF utf8:ef,ac,80>
#<U+FB01 ﬁ LATIN SMALL LIGATURE FI utf8:ef,ac,81>
#<U+FB02 ﬂ LATIN SMALL LIGATURE FL utf8:ef,ac,82>
#<U+FB03 ﬃ LATIN SMALL LIGATURE FFI utf8:ef,ac,83>
#<U+FB04 ﬄ LATIN SMALL LIGATURE FFL utf8:ef,ac,84>
#<U+FB05 ﬅ LATIN SMALL LIGATURE LONG S T utf8:ef,ac,85>
#<U+FB06 ﬆ LATIN SMALL LIGATURE ST utf8:ef,ac,86>
#<U+FB13 ﬓ ARMENIAN SMALL LIGATURE MEN NOW utf8:ef,ac,93>
#<U+FB14 ﬔ ARMENIAN SMALL LIGATURE MEN ECH utf8:ef,ac,94>
#<U+FB15 ﬕ ARMENIAN SMALL LIGATURE MEN INI utf8:ef,ac,95>
#<U+FB16 ﬖ ARMENIAN SMALL LIGATURE VEW NOW utf8:ef,ac,96>
#<U+FB17 ﬗ ARMENIAN SMALL LIGATURE MEN XEH utf8:ef,ac,97>

Generated with the following Ruby script:

require 'unicode_utils'

UnicodeUtils::Codepoint::RANGE.each do |i|
	u = UnicodeUtils::Codepoint.new(i)
	begin
		s = u.to_s
	rescue RangeError
		# U+D800 etc. cause this
		next
	end
	next if UnicodeUtils.nfc(s) != s
	p u if UnicodeUtils.upcase(s).length != s.length
end

(Looks like the library I used only has data for Unicode 6.2, so that list is almost certainly non-exhaustive.)

Awesome @matmarex, but it seems like PHP mb_strtoupper doesn't actually change these (at least the ones I've tried) while JavaScript does. This proves that the length-check I've added in my PR doesn't fix the problem at all.. I'm starting to think that we need to add an explicit list of exceptional characters, but trying to come up with an exhaustive list of problematic characters is an almost impossible task since it would depend on the version of PHP and JS. Maybe just using the list you've made is good enough?

@Pchelolo, for all those chars, the length check is sufficient to prevent javascript from changing the title then. So, I am confused by what you meant when you say the PR doesn't fix the problem.

@ssastry hm.. Actually you're right, it will help. But we need to modify the test a bit to account for characters in astral planes.

What's missing by just doing .replace(/^[a-z]/, (t) => t.toUpperCase());?

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

In T141723#2700410, @Arlolra wrote:

What's missing by just doing .replace(/^[a-z]/, (t) => t.toUpperCase());?

All of the non a-z lowercase characters in all non-english languages :)

In T141723#2700374, @Pchelolo wrote:

Awesome @matmarex, but it seems like PHP mb_strtoupper doesn't actually change these (at least the ones I've tried) while JavaScript does. This proves that the length-check I've added in my PR doesn't fix the problem at all..

How so? That sounds like the exact opposite, your patch would fix the issue.

Maybe just using the list you've made is good enough?

It is almost certainly missing characters added after Unicode 6.2.

In T141723#2700433, @GWicke wrote:

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

How so? That sounds like the exact opposite, your patch would fix the issue.

Ye, right, I was misleaded a bit. I've added your list to the tests in the lib, so we have explicit tests for all of these cases.

I've merged the PR and published the new version. Once all involved parties (parsoid and RESTBase) get deployed we'd need to retest this.

Change 314783 had a related patch set uploaded (by Arlolra):
Bump mediawiki-title for T141723

https://gerrit.wikimedia.org/r/314783

gerritbot added a project: Patch-For-Review.Oct 7 2016, 10:41 PM

Change 314783 merged by jenkins-bot:
Bump mediawiki-title for T141723

https://gerrit.wikimedia.org/r/314783

Here's my comparison. It found 243 cases to Bartosz's 100. But we've both only searched single codepoint characters, so there are likely more.

upper-differences.csv7 KBDownload

script:

<?php

echo '"Codepoint", "Character", "mb_strtoupper", "String.toUpperCase"'."\n";

for ( $i = 0; $i < 65536; $i++ ) {
	$char = mb_convert_encoding( '&#' . $i . ';', 'UTF-8', 'HTML-ENTITIES' );
	$php = mb_strtoupper( $char );
	$js = exec( 'node -p \'' . json_encode($char) . '.toUpperCase();\' 2>/dev/null' );
	if ( $js !== '' && $js !== $php ) {
		echo '"\\u', str_pad( dechex( $i ), 4, '0', STR_PAD_LEFT), '", "' , $char, '", "', $php, '", "', $js, '"', "\n";
	}
}

• MZMcBride subscribed.Oct 8 2016, 3:52 AM

In T141723#2701096, @Esanders wrote:

Here's my comparison. It found 243 cases to Bartosz's 100. But we've both only searched single codepoint characters, so there are likely more.

MediaWiki only takes the first codepoint for uppercasing, so I think it's sufficient. See Title::capitalize() and Language::ucfirst(). Hardcoding the list would still be problematic, since new characters are introduced in new Unicode versions.

(I wonder if there are actually any cases where uppercase(character+combining) would differ from uppercase(character)+combining?)

Your list has some examples where the length is identical, but the character is different (e.g. ǋ → Ǌ), so it looks like the length heuristic I proposed earlier and which was implemented is wrong (or at least, doesn't solve all cases).

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

Yeah, the SS mapping would not be popular in Germany..

I almost wonder if it would be worth updating / fixing the PHP unicode support now, rather than painstakingly documenting current limitations. Sure, this would break things right now, but it seems inevitable that some future version of PHP will need to get updated unicode support anyway.

In T141723#2701743, @GWicke wrote:

Just to add a bit more complexity.. are we sure that HHVM has the same single-codepoint restriction as PHP?

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

Yeah, the SS mapping would not be popular in Germany..

Ah? I was taught in school that these two were equivalent (unlike the addition of an e for umlauts, which seems to be simply a pragmatism).

I almost wonder if it would be worth updating / fixing the PHP unicode support now, rather than painstakingly documenting current limitations. Sure, this would break things right now, but it seems inevitable that some future version of PHP will need to get updated unicode support anyway.

+1 to that idea.

Change 314725 had a related patch set uploaded (by Esanders):
Add exception in mw.Title for 'ß'.toUpperCase() so it matches PHP

https://gerrit.wikimedia.org/r/314725

In T141723#2701761, @mobrovac wrote:

In T141723#2701743, @GWicke wrote:

Yes, otherwise the German Wikipedians would have skinned us alive years ago when their ß would have become inaccessible ;)

Yeah, the SS mapping would not be popular in Germany..

Ah? I was taught in school that these two were equivalent (unlike the addition of an e for umlauts, which seems to be simply a pragmatism).

Lower-case 'ss' is used as a stand-in for situations where ß is not available, but capital SS by itself has very different connotations in Germany, and would not be considered a legitimate uppercase of ß (which doesn't exist in practice).

• GWicke edited projects, added Services (doing); removed Services.Oct 12 2016, 4:18 PM

• GWicke triaged this task as Medium priority.Oct 12 2016, 5:57 PM

Everything is done here on the Services side, so it can be closed once desktop fronted side is also fixed

Mentioned in SAL (#wikimedia-operations) [2016-11-02T20:27:45Z] <arlolra> updated Parsoid to version 173d7e32 (T149241, T119228, T141723, T141905, T147742, T48580, T133320)

Stashbot mentioned this in T141905: Parsoid crashes from logs.Nov 2 2016, 8:27 PM

Stashbot mentioned this in T147742: Template generated data-mw include incomplete href.

Stashbot mentioned this in T48580: Create a VisualEditor plugin to integrate with ProofreadPage.

Stashbot mentioned this in T133320: Unified extension registration mechanism for core/VE/Parsoid.

I think this is working properly on Android in 7a04741abae79b23a9e590d969c1d559acf28132.

Change 314725 merged by jenkins-bot:
Add exceptions in mw.Title where mb_strtoupper doesn't match String.toUpperCase

https://gerrit.wikimedia.org/r/314725

Sounds like it's fixed everywhere now.

Jdlrobson awarded a token.Nov 23 2016, 6:20 PM

ReleaseTaggerBot added projects: MW-1.29-release-notes, MW-1.29-release (WMF-deploy-2016-11-29_(1.29.0-wmf.4)).Nov 28 2016, 1:00 AM

Esanders reopened this task as Open.Mar 26 2019, 1:00 PM

Restricted Application added a project: Product-Infrastructure-Team-Backlog-Deprecated. · View Herald TranscriptMar 26 2019, 1:00 PM

It looks like mb_strtoupper behaves differently in PHP7, probably due to Unicode updates. I re-ran the build script (provided in the patch below) and got the following diff:

--- a/resources/src/mediawiki.Title/phpCharToUpper.js
+++ b/resources/src/mediawiki.Title/phpCharToUpper.js
@@ -6,15 +6,8 @@
 	var toUpperMapping = {
 		'ß': 'ß',
 		'ŉ': 'ŉ',
-		'ǅ': 'ǅ',
-		'ǆ': 'ǅ',
-		'ǈ': 'ǈ',
-		'ǉ': 'ǈ',
-		'ǋ': 'ǋ',
-		'ǌ': 'ǋ',
 		'ǰ': 'ǰ',
-		'ǲ': 'ǲ',
-		'ǳ': 'ǲ',
+		'ɪ': 'Ɪ',
 		'ʝ': 'Ʝ',
 		'ͅ': 'ͅ',
 		'ΐ': 'ΐ',
@@ -26,6 +19,15 @@
 		'ᏻ': 'Ᏻ',
 		'ᏼ': 'Ᏼ',
 		'ᏽ': 'Ᏽ',
+		'ᲀ': 'В',
+		'ᲁ': 'Д',
+		'ᲂ': 'О',
+		'ᲃ': 'С',
+		'ᲄ': 'Т',
+		'ᲅ': 'Т',
+		'ᲆ': 'Ъ',
+		'ᲇ': 'Ѣ',
+		'ᲈ': 'Ꙋ',
 		'ẖ': 'ẖ',
 		'ẗ': 'ẗ',
 		'ẘ': 'ẘ',

I verified that in a PHP5.3 environment, mb_strtpupper("ǅ") returns ǅ, but it PHP7 it returns Ǆ (which matches the JS, hence the removal from the list).

This causes another issue, which is that the page https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no becomes unreachable if I enable the PHP7 beta feature. (edit: filed as T219279)

I'm not sure if it's going to be possible for us to have different versions of this script served depending on the host's PHP version?

Esanders mentioned this in T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes.Mar 26 2019, 1:27 PM

Change 499196 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/core@master] Title: Add scripts fo generating/updating phpCharToUpper.js

https://gerrit.wikimedia.org/r/499196

gerritbot added a project: Patch-For-Review.Mar 26 2019, 1:43 PM

There is a recent very similar issue with Georgian letters: T208139

Esanders mentioned this in T208139: Georgian words are automatically (incorrectly) capitalized when entered.Mar 27 2019, 11:38 AM

LGoto moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Mar 27 2019, 3:39 PM

Change 499196 merged by jenkins-bot:
[mediawiki/core@master] Title: Add scripts for generating/updating phpCharToUpper.js

https://gerrit.wikimedia.org/r/499196

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.24; 2019-04-02).Mar 28 2019, 8:02 PM

Is this now resolved? Follow-up work also exists at T219279, but that has its own ticket.

Krinkle removed projects: Patch-For-Review, MW-1.29-release (WMF-deploy-2016-11-29_(1.29.0-wmf.4)).Jul 10 2019, 7:19 PM

In T141723#5322225, @Krinkle wrote:

Is this now resolved? Follow-up work also exists at T219279, but that has its own ticket.

@Esanders, @Pchelolo, @matmarex: Could anyone answer / clarify, please? ^

LGoto moved this task from Needs Triage to Backlog on the Parsoid board.Feb 15 2020, 9:42 PM

I am going to untag Parsoid and add CPT for followup and status updates.

https://de.wikipedia.org/api/rest_v1/page/html/ß loads properly. This has been fixed long time ago.

Jack_who_built_the_house awarded a token.Jul 15 2020, 8:59 PM

Legoktm mentioned this in T297342: Expose phpCharToUpper map for title normalization via the API.Dec 9 2021, 6:39 AM

[Bug] Wikipedia article on the letter "ß" does not load properly.Closed, ResolvedPublic1 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

[Bug] Wikipedia article on the letter "ß" does not load properly.
Closed, ResolvedPublic1 Estimated Story Points
Actions