Page MenuHomePhabricator

Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers
Closed, ResolvedPublic

Description

There are a number of language fallbacks specified in Mediawiki—in mediawiki/languages/messages/Messages<Xxx>.php—and while these are often appropriate for languages of messages and banners, they are often woefully inappropriate for linguistic analysis.

Three illustrative examples: Guaraní / Spanish, Wolof / French, and Chechen / Russian. For each X / Y pair, it does make sense that a speaker of X is, for historical or geographical reasons, more likely to know Y than most other world languages, but X and Y are not linguistically related, and processing Y as if it were X makes little if any sense.

The Wolof Wikipedia config, for example, shows "french" as the "text" and "text_search" analyzers. My favorite current example of nonsensical language processing is searching for "lorsqu'ele" on the Wolof Wikipedia—if you know a little French and a few secrets about the French analysis chain, it's no surprise that it returns matches on element—but I don't think that's the expected result in Wolof (though Wolof speakers are more likely to know some French, and thus may not be entirely surprised).

I've compiled a list of 180+ such fallbacks that are currently in use. We should systematically work our way through these and disable the inappropriate fallbacks and re-index the relevant wikis. This should be done in consultation with the relevant communities for each language. While some seem patently silly, others are harder to evaluate as someone not familiar with the languages involved. For example, given that Ukrainian and Russian are still somewhat mutually intelligible (and use the same writing system), maybe using the Russian language analyzer instead of the default analyzer is a net positive on Ukrainian Wikipedia.

We also need to refactor AnalysisConfigBuilder.php to allow us to move away from using the current fallbacks, and to specify linguistically reasonable fallbacks, if any. To start, we can decouple the current code from the Messages<Xxx>.php files and replicate the current behavior. The next step would be to remove the irrelevant mappings (i.e, those involving languages for which there are no wikis), and then begin the big effort of determining the correct action for the remaining mappings (keep, modify, delete) and re-indexing as needed.

Event Timeline

FYI: https://commons.wikimedia.org/wiki/File:MediaWiki_fallback_chains.svg (might be a little outdated).

Indeed, interface language fallback criteria is not based solely on linguistic similarity, but also on other aspects.

@Nikerabbit Thanks for that image! I tried compiling one myself, but couldn't get GraphViz to do something that wasn't ridiculously wide (I didn't include English, which may have hurt my chances).

debt triaged this task as Medium priority.Oct 13 2016, 1:05 AM
debt added a project: Discovery-ARCHIVED.
debt moved this task from needs triage to Up Next on the Discovery-Search board.

I'm wondering if we can get some community help with this - linguistically speaking - to determine what languages should fall back to others. :) This won't be an easy task!

(Note, the task description still contains some mistakes.)

I just want to mention that http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html is our canonical source to identify e.g. if inhabitants of a certain region understand a certain language. So from here you can tell whether a language fallback in MediaWiki *could* have been added for geographical reasons rather than for reasons of language similarity.

I'm wondering if we can get some community help with this - linguistically speaking - to determine what languages should fall back to others. :) This won't be an easy task!

Generally, I don't think languages should fall back to others for language analysis, unless they are extremely similar—even then, it would require looking at the analysis results to say whether a fallback works or not.. For the most part, the fallback should be the standard analyzer that is language agnostic, because there's no analyzer for most languages.

(Note, the task description still contains some mistakes.)

I believe I have removed the portion of the task description that you object to.

... http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html is our canonical source to identify e.g. if inhabitants of a certain region understand a certain language.

Thanks for the link.

I've finished my analysis of the current state of affairs, with details on which wikis are using which language analyzers, and which languages are related (a very weak chance of being cross-linguistically useful) and which are considered mutually intelligible (a small to moderate chance of being cross-linguistically useful).

In summary—there are 102 wikis with non-exact language analysis configurations:

  • 47 are obvious linguistic mis-matches.
  • 12 are configured with the analyzer for a reasonably mutually intelligible language and so have a reasonable potential to be doing more good than harm.
  • The middle 43 are genetically related, but not really very likely on average to benefit hugely from having the wrong-language analyzer.

That's a lot to sort through. Doing a moderately detailed analysis of each one and getting feedback from the communities would take several months at least.

  • We could turn off some of the more linguistically obviously poor ones and see if anyone complains.
  • We could invite comment (for all, or for only the less obviously bad ones) and see if anyone in the community wants to see some sort of analysis of what kind of difference it makes with and without the analyzer. If there is no objection or request for analysis, we could turn them off.
  • We could do some pre-emptive analysis for the ones most likely to be similar and start the conversation with the communities with that information.

Suggestions on how best to approach this huge undertaking are very welcome!

I've posted a detailed description of the plan and the reasoning behind it on MediaWiki.

I've posted to several mailing lists:

Next up: contacting specific language communities for the small number of possible exceptions to turning it off.

Change 382033 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Remove Messaging Fallback Languages as Language Analysis Fallbacks

https://gerrit.wikimedia.org/r/382033

I've uploaded a WIP patch because it may help @Smalyshev with T176903.

These are the fallbacks that I've asked asked for community feedback on, with links to the discussions:

I asked and configured Livvi-Karelian wikipedia to use uca-fi as the language is very similar to Finnish and there's no uca-xx-olo. See T147064: Determine category collation for Livvi-Karelian Wikipedia (olo.wikipedia.org).

@MarcoAurelio, this isn't about the collation sequence, which should not be affected. This is about the processing of the actual text of queries and the text in the articles (and other pages) by the language analyzer, indexing it for search. The primary aspects for most languages (including Finnish) are stop words and stemming.

Stop words are common words that don't carry much meaning, in English they include articles (the, a), prepositions (to, of, with), conjunctions (and, but), some other very common words that don't mean much, like forms of be, do, make, etc., and others. Stop words are not required for a text match, but they do give some additional weight to exact matches, and so we can match phrases that are all stop words, like "take that", "the the", and "to be or not to be".

Stemming attempts to reduce related words to a common base form, so in English hope, hopes, hoped, and hoping would all be indexed as hope and each could find the others.

So, the question is, does applying Finnish grammar rules to Livvi-Karelian text help a lot more than it hurts?

There are several potential problems even if there are also some benefits. There can be incomplete coverage—so some stop words are the same, while others are different, and they are treated differently. There can be words that look like stop words to the analyzer but are not. If there are spelling differences for inflections on words, some may be stemmed properly and others not. Parts of words that look like inflections but are not can be stripped by the stemmer, linking words in the index that are not actually related. The analyzer knows certain exceptions—irregular spellings or inflections—but if the words are spelled differently in another language, they will not be treated properly.

Another problem we've run into is that changes that do the right thing in one language are wrong for another. I originally the entire fallback situation when adjusting the analyzer for Russian to treat two letters as identical (T124592); this is good for Russian, but not for any other language using the Cyrillic alphabet, like Ukrainian. The added complexity makes mistakes like this more likely to happen and harder to catch. Breaking changes can also sit for months without any effect because not only does the code have to be changed, the wiki has to be re-indexed for the changes to take effect. This can make tracking down problems much harder.

The code is done, though still in WIP status. It also incorporates Stas's changes to refactor the analysis config builder for Wikibase. The code hasn't been merged yet, because I'm waiting on the final list of language pairs to retain—so at most a few lines of an array need to be deleted and we'll be set.

There's been some discussion about Mirandese and Limburgish; the rest have been silent so far. If we do get complaints afterward, it's easy enough to reverse course on any particular language pair.

A non-WIP patch is up. Even after some discussion, it seemed best to remove the wrong analyzer in every case.

Side note: I may pick up working on a very basic analyzer for Mirandese as a 10% project—just adding some stop words and handling elision, which looks very similar to French. If that works out, then doing more like that for other languages would be an interesting project or side project.

I wonder shouldn't we use "cjk" for "other" zh languages? Right now with the config we have not all of them even using ICU.

Change 382033 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Remove Messaging Fallback Languages as Language Analysis Fallbacks

https://gerrit.wikimedia.org/r/382033

I've created T177871 for re-indexing affected wikis and added it to T147505 (the recurring re-indexing ticket).

I've got some additional follow-up tasks. These first ones are more pressing:

  • brief review all un-fallbacked languages configs
    • investigate using CJK (or ICU) for "other" zh languages
  • spot test known issues and examples (see notes on MediaWiki) as re-indexing progresses.

These seem quick, so I'll follow up in comments on T177871 and with any additional tasks needed.

These are less pressing, but good things to do, so I've created/linked to tasks for them

  • investigate enabling ICU tokenization for all defaults (T177876)
  • investigate Elasticsearch Norwegian Nynorsk stemmer (T177877)
  • disable word break helper where it doesn’t/can’t do anything (and T170625 more generally)

This is potentially a longer-term project that has always been on my mind, but I'm not going to create a ticket/epic for it right now.

  • Fix language-specific folding in general (i.e., language by language)

Very nicely done, @TJones!