Page MenuHomePhabricator

Searching with long s (ſ) character causes everything else to be interpreted as if capitalized
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Search "class" on en.wikipedia.org
  • Then search "claſs" on en.wikipedia.org

What happens?:
The first one takes you to https://en.wikipedia.org/wiki/Class as expected. The second one takes you to https://en.wikipedia.org/wiki/CLASS (all uppercased).

What should have happened instead?:
Switching the s for a long s should not make a difference, because the long s is just an archaic typographic variant of the standard s.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
Firefox version 140.0.4

Details

Related Changes in Gerrit:

Event Timeline

In fact, the situation is even more complex: if you search on en.Wiktionary.org for "uſ" or "aſs", you are taken to "US" and "ASS" respectively, not to "us"/"ass" or "Us"/"Ass" (even though all of those pages exist), but if you search for "ſharp" you are taken to "Sharp", not to "SHARP" or to "sharp" (even though both of those pages exist), i.e. in one case it is capitalizing the whole search string, but in another case (where the S is the first letter) it is not capitalizing the whole string, only the "S". And in a third case, there is a third behaviour: if you search for "Aſia" (or "aſia"), you are taken to neither "Asia" nor "asia": you are dumped onto the Search results page. (The same thing happens if you search for "CamelCaſe" or "camelcaſe".)

(Edited to mention "uſ" alongside "aſs", just to show that the behaviour occurs even when a search string contains only long "ſ" and no regular "s".)

This is working more or less as designed, but you've run into some odd corner cases. (Excellent job finding aſia and ſharp, by the way, @-sche!)

The key fact that explains the behavior is that a small number of letters have a different letter as their designated uppercase version within the Mediawiki language-specific case manipulation functions. For English (and all others I can think of off the top of my head) uppercase ſS & ßSS (and Greek ςΣ).. there may be others, but I can't come up with any at the moment. (I am disappointed to see that it does not do the right thing for Turkish I/ı & İ/i when set to Turkish, so I'm not sure how language-specific it is, alas.)

First the Mediawiki code tries to find an exact match in a title or redirect. So on English Wikipedia, Iſland goes to "Island" not "ISLAND" (another redirect) because there is an exact-match redirect. There is no article or redirect for claſs. (I found a few of these unexpected long-s variant redirects like "Iſland"—you can do a regex search for intitle:/ſ/.. no direct link because its an expensive search.)

When it's looking for a non-exact match, it tries a few things in a specific order. Code here if you want to follow along.

Next, it lowercases the word ($this->language->lc( $term )), and then makes a page title out of it, and looks for that. On English Wikipedia, the first letter gets capitalized because all articles must start with a capital letter (still no match for Claſs) On English Wiktionary, it remains all lowercase because articles are allowed to start with a lowercase letter over there (still no match for aſia or ſharp).

Next it tries uppercasing individual words (ucwords())... so mr rogers would become Mr Rogers. On Wikipedia, we get Claſs again. On Wiktionary, we get Aſia, with has no match, and Sharp—which does match an article title. So ſharpSharp.

Then it tries uppercasing the whole word (uc()). Wikipedia gets CLASS, which exists, so claſsCLASS. Wiktionary tries ASIA and finds nothing.

Next it tried breaking words after hyphens and uppercasing the first letter of those words (ucwordbreaks()—I didn't know that was a thing before). So, for mr-rogers it would have previously tried Mr-rogers, now it tries Mr-Rogers. For one-word queries with no hyphens, this is just Aſia again.

Now it tries more agressive normalization (onSearchGetNearMatch().. which goes off to different code). For all languages, it tries ICU Normalization, which converts ſs, ßss, and ςσ; however it also lowercases the whole word. There is an even more aggressive next step (ASCII folding) that removes most diacritics on Latin characters, but for some languages ASCII folding is upgraded to ICU folding, which strips diacritics off of almost all characters—but knows to leave native characters alone (so for English, A Ä Å Á Ã a ä å á ã all become a, but in Swedish, Ä & Å are just lowercased to ä & å, and ä & å are unchanged).

For queries like ėıńſțěĩņ on English Wikipedia, after all this aggressive normalization, there's only one potential match (the redirect "Einstein"). However, for a query like ćłáșś there are multiple matches ("Class" and "CLASS") so it rolls over to the fulltext search results. Same for aſia on Wiktionary (with "asia" and "Asia").

So, to summarize, unexpected uppercase rules like ſS—that implicitly do some normalization but usually do the right thing—can interact with multiple article titles (and redirects) that only differ by case to give unintuitive results.

I'm loathe to try to change the uppercasing rules, because they are probably used in places other than search and title matching, and it would be very bad to break correct functioning like uc("claſs")"CLASS" or uc("Straße")"STRASSE". Adding non-standard normalization to the lowercasing rules (so lc("ſ")"s") is also not a good idea. And none of these would address the problems like cláss ("Class" vs "CLASS") or máchine ("Machine" vs "MacHine", oddly) rolling over to fulltext search results.

So, if there are no objections, I'd like to close out this ticket, because there really isn't anything to do here, other than explain why it works the way it does in these cases.

I appreciate the explanation. I defer to the opener, @Ioaxxere, regarding whether this can be closed. (For context, the issue was noticed due to an en.Wiktionary discussion about whether to have "manual" redirects for long-s spellings, or whether the software can automatically "redirect" them; I will take this discussion as concluding that the software does not, and will not be made able to, automatically redirect them.

I will take this discussion as concluding that the software does not, and will not be made able to, automatically redirect them.

I don't think the current situation warrants wholesale addition of long-s redirects for every possible word, since in many cases the software does automatically redirect them correctly. From the Long s page on enwiki, examples like ſinfulneſs, poſſeſs, and ſubſtitute work fine on Wiktionary. It's just when there is case ambiguity that it doesn't (like ſong/"Song" and claſs/"CLASS").

Wiktionary is of course a tougher situation because there are many more entries that differ only by case.

I did think of another approach, which would be to uppercase the search term, then lowercase it, effectively applying the uppercasing normalization to the lowercase version of the word. However, I'm hesitant because it would change the way matches are made elsewhere, and I'm not sure how big or how bad the effect could be.

On German Wiktionary, we have entries for Fuß, Fuss, and fuss, but no fuß. Uppercasing then lowercasing would make searching for fuß match fuss instead of Fuß, which could disappoint someone else.

And I'm not sure about introducing other errors, like Turkish İ gets lowercased to i with an extra combining dot on top, and then uppercasing it back could do something weird. (I checked and that specifically seems to work out okay, but there may be other weird examples of unexpected casing, and it would be awful to break a common letter in one of the many languages we support.)

Another approach would be to introduce some exceptional specific normalizations like ſ → s, but that is exactly what the more generic ICU folding normalization after messing around with the case variants is supposed to capture, and hard-coding it early feels like adding technical debt.

I'm sorry there's not a clean solution.

In my opinion, adding a special case for long s to lowercase s is worth it even in spite of the 'technical debt' it would incur, because claſs (which is all in lowercase) redirecting to CLASS instead of class is unacceptably counter-intuitive.

In my view, this should be an additional step before the lowercasing process where some normalization is done from a separate table.

TJones set the point value for this task to 5.

I don't want to downplay the long-term cost of technical debt. A random local table in the Title Matcher code, justified by one class of English examples, would be bad software engineering practice and bad language support practice. Trying to suss out all of the possible and plausible characters like this and doing something to them that differs from all the other normalization everywhere else in Mediawiki (including later in this same process) would be brittle and confusing.

I spent some time today digging into this more, and I've found more examples with ligatures and character variants that cause words to get uppercased on English Wiktionary: fa, fa, st, ip, b, ϐιπερ, µίδας. I've found more characters (like ᾳ) that could cause problems, but no case-ambiguous entries happen to exist (at the moment).

I see that PHP supports ICU normalization, but doing it too early in the process (or even at all) could be too expensive. We process many queries per second with this code, so we can't afford to slow it down too much, and I'm not familiar with the efficiency of the PHP normalizer. Also, the PHP normalizer is almost never used in our code base, but there is a lot of (complex) normalization code and config in our language modules.. plus some scary comments about some normalization being "disabled by default to avoid negative performance impact"—increasing my concerns about efficiency.

I'm also worried about screwing up text in other languages. Depending on which ICU normalization form you use, ß can get converted to ss; and while it isn't current at the moment, the Turkmen alphabet used ſ (with uppercase £) in living memory. (They also used $/¢ and ¥/ÿ as letters.) There's always the chance of some unexpected feature of a language or script that could interact oddly.

unacceptably counter-intuitive

I get where you are coming from, but realisitically there aren't that many people who search with long s—the typographic ligatures seem more likely to occur when people cut-n-paste text from the web or some typographically sophisticated sources—and having rare corner cases that are unintuitive is better than not having case-folding functionality at all.

It's now Friday evening and many of the Search team members have been out this week, but will be back Monday. I'll try to discuss this with them next week and see what those with more PHP and Mediawiki experience think about trying to add more normalization before case folding... though it feels like we need two checks—lowercased normalized and capitalized normalized.. though we only have to normalize once... but that's getting too far into implementation details for now.

So, PHP has a Normalizer class with a normalize() function that does something very similar to the icu_normalizer filter in OpenSearch, except that it doesn't also necessarily lowercase. It looks like we want "compatibility decomposition followed by canonical composition", or the "NFKC" form, without additional case folding ("..._CF").

Oddly, the Mediawiki code almost never uses Normalizer::normalize(), and it has a wrapper class UtfNormal\Validator that has functions for the various normalization forms, including Validator::toNFKC().

There are still a few characters that fall through the cracks with NKFC normalization. ß and ς do not get normalized to ss and σ respectively. From talking to speakers of German and Greek, that's not an awful result. (NFKC_CF normalization, with case folding, does map ßss and ς → and σ. The UtfNormal\Validator class does not have NFKC_CF normalization as an option, though we could fall back to Normalizer if it become apparent that it is necessary.)

Since I am concerned about efficiency, I ran a bunch of tests and got average timing estimates for applying various mappings and normalizations to various strings. The estimates are the time (using PHP's high-resolution timer in nanoseconds) to perform the relevant operation 1000 times to the given query string. Results were mostly in the 10⁶–10⁷ range, so I divided 10⁶ out, giving units of millisecond/1000ops, or mean microseconds/op.

In general, longer strings took longer to process—what a shocker! Uppercasing and queries with uppercase letters seem to take a smidge longer than analogous lowercase and lowercasing, which is unexpected. Non-ASCII letters (Cyrillic, Greek, Katakana, Canadian Syllabics, Devanagari, Chinese, and uncommon variant characters like long-s (ſ) and typographic ligatures (fi)) take a bit longer to process.

Uppercasing words (ucwords) is a bit slower because of the logic to find word boundaries, and it is a lot slower on strings with non-ASCII letters, and much, much slower for strings with multibyte characters (see the last two example queries, which differ only by a multibyte em dash (—) vs two dashes (--)... looks like PHP's mb_strtoupper can apparently be >10x slower than strtoupper).

Uppercasing words with additional word boundaries, like hyphens, (ucwordbreaks) is slower than ucwords on all strings, because there's some semi-serious regexing going on internally.

Normalizing consistently falls between uc and ucwordbreaks, usually fairly close to 2x uc and ½x ucwordbreaks. Using Validator::toNFKC has a consistent overhead versus Normalizer::normalize for the extra function call wrapper, but it's only 0.1 microsecond/op and there's some extra logic in there.

The big shocker is that all of this is chump change compared to the cost of the near_match search that runs when everything else fails, which is hundreds to thousands of times more expensive than any of the other case or normalization transformations. It looks like there is a lot of fixed overhead to set up the near_match query so the timings fluctuate a fair amount but generally cluster around the same values—though variants of class are slower than the others because my toy wiki has a couple pages named "Class" and "CLASS"—all the others get zero near_match results. I expect on big wikis it's quite a bit slower. Of course, we don't normally get to the near_match step, which is a last-ditch effort to match something.

So, we are talking about adding up to one extra call to each of Validator::toNFKC (no reason to normalize more than once), uc, and ucwords. Most of the time they won't get called—never when a user clicks on a suggestion—and sometimes they may prevent a call to near_match. And as @dcausse pointed out, all this only happens when the user hits return in the Go Box, not on every keystroke, so the cost seems reasonable. (Interesting note: if you click on a suggestion, it still submits the title of the suggestion as if you had typed it and hit return. It's fast, though, because it's an exact match for an existing title.)

querylcucucwordsucword­breaksnormalizetoNFKCNFKC_CFnear_​matchtoNFKC/uctoNFKC/ucwdbrks
prioria copaifera1.01.11.36.32.42.52.67298227.27%39.68%
Prioria Copaifera1.11.21.46.12.42.52.97269208.33%40.98%
ſẛſtfffiflffist և ϐςς1.91.99.810.17.07.16.97721373.68%70.30%
class1.01.11.24.42.32.42.48266218.18%54.55%
CLASS1.11.01.34.22.32.42.48229240.00%57.14%
Class1.11.11.34.02.32.42.47927218.18%60.00%
claſs1.31.35.35.52.52.62.78383200.00%47.27%
ſharpe1.31.45.65.72.42.52.57362178.57%43.86%
sharpe1.01.11.24.42.32.42.47474218.18%54.55%
Sharpe1.11.01.34.32.32.42.47329240.00%55.81%
SHARPE1.11.01.34.52.32.42.67549240.00%53.33%
daß1.31.35.25.72.22.33.87512176.92%40.35%
Вікіпедія1.61.76.06.22.42.52.57526147.06%40.32%
Βικιπαίδεια1.71.76.26.42.52.62.67505152.94%40.63%
ウィキペディア1.61.66.16.32.42.52.57408156.25%39.68%
ᐅᐃᑭᐱᑎᐊ1.51.56.16.62.42.52.57578166.67%37.88%
विकिपीडिया1.81.86.46.42.52.62.67626144.44%40.63%
维基百科1.41.56.46.02.42.52.37642166.67%41.67%
Конвой «Вевак-Голландія № 2»2.52.510.712.54.84.95.47723196.00%39.20%
a long piece of text with just lowercase latin ascii letters1.21.41.722.03.03.13.47317221.43%14.09%
Cycling at the 2010 South American Games – Women's individual pursuit3.13.121.721.73.23.34.28068106.45%15.21%
Cycling at the 2010 South American Games -- Women's individual pursuit1.51.51.921.23.23.34.17522220.00%15.54%

I'll submit a patch with the normalization updates and see how it goes.

Thanks @-sche and @Surjection for sticking with the discussion while I worked to understand the issue and the code and come up with plausible approaches. Honestly, I'm surprised how relatively efficient this approach seems to be, but I'm glad it looks like it will work out.

Change #1176523 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/core@master] Add early normalization to TitleMatcher

https://gerrit.wikimedia.org/r/1176523

Thanks for looking into ways to address this!

Change #1176523 merged by jenkins-bot:

[mediawiki/core@master] Add early normalization to TitleMatcher

https://gerrit.wikimedia.org/r/1176523

@TJones Thanks for the writeup. So it seems like the paradoxical situation arises because, in response to ſ, the normalizer tries S before s. Given that ſ is always s and never S in actual usage, this seems suboptimal, so I support the early normalization idea. But I am really surprised that all of your benchmarks showed string operations taking at least a full microsecond (an eon by modern CPU standards), including when nothing is changed at all! Is it possible that the actual performance killer is just PHP overhead? It seems like PHP supports FFI, though I doubt it will come to that.

@Ioaxxere I wouldn't say that the "normalizer" tries S before s for ſ.. rather, there is accidental normalization because ſ is one of the few characters where lowercase(uppercase(char)) != char. It was the lack of explicit normalization that caused the problem... every individual step makes sense in general, but their edge cases happened to line up just right in this instance.

Don't put too much stock in the specific numbers in the timings table. The test was wedged inside my local copy of the Cirrus code, running in a not particularly optimized virtual machine on my laptop, so it's not exactly the speediest or most resource-rich environment. Our production servers are much, much beefier!

Since this only needed a code deploy and not a re-index, it went out pretty quickly. claſsclass on enwiki now, along with all the other test cases on enwiki and enwikt.