There are a number of language fallbacks specified in Mediawiki—particularly in `extensions/UniversalLanguageSelector/lib/jquery.i18n/jquery.i18n.fallbacks.js`, which seem to have been copied to `mediawiki/languages/messages/Messages<Xxx>.php`—and while these are often appropriate for languages of messages and banners, they are often woefully inappropriate for linguistic analysis.
Three illustrative examples: Guaraní / Spanish, Wolof / French, and Chechen / Russian. For each X / Y pair, it does make sense that a speaker of X is, for historical or geographical reasons, more likely to know Y than most other world languages, but X and Y are not linguistically related, and processing Y as if it were X makes little if any sense.
The [[ https://wo.wikipedia.org/w/api.php?action=cirrus-settings-dump | Wolof Wikipedia config ]], for example, shows `"french"` as the `"text"` and `"text_search"` analyzers. My favorite current example of nonsensical language processing is searching for [[ https://wo.wikipedia.org/w/index.php?title=Jagleel:Ceet&profile=default&fulltext=Search&search=lorsqu%27ele | "lorsqu'ele" on the Wolof Wikipedia ]]—if you know a little French and a [[ https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Adding_Ascii-Folding_to_French_Wikipedia#Unexpected_Features_of_French_Analysis_Chain | few secrets ]] about the French analysis chain, it's no surprise that it returns matches on //element//—but I don't think that's the expected result in Wolof (though Wolof speakers are more likely to know some French, and thus may not be entirely surprised).
I've compiled [[ https://www.mediawiki.org/w/index.php?title=User:TJones_(WMF)/Notes/Fallback_Langauges | a list of 180+ such fallbacks ]] that are currently in use. We should systematically work our way through these and disable the inappropriate fallbacks and re-index the relevant wikis. This should be done in consultation with the relevant communities for each language. While some seem patently silly, others are harder to evaluate as someone not familiar with the languages involved. For example, given that Ukrainian and Russian are still [[ https://en.wikipedia.org/wiki/Ukrainian_language | somewhat mutually intelligible ]] (and use the same writing system), maybe using the Russian language analyzer instead of the `default` analyzer is a net positive on Ukrainian Wikipedia.
We also need to refactor AnalysisConfigBuilder.php to allow us to move away from using the current fallbacks, and to specify linguistically reasonable fallbacks, if any. To start, we can decouple the current code from the Messages<Xxx>.php files and replicate the current behavior. The next step would be to remove the irrelevant mappings (i.e, those involving languages for which there are no wikis), and then begin the big effort of determining the correct action for the remaining mappings (keep, modify, delete) and re-indexing as needed.