Various languages have additional features and filters enabled that are not particularly language-specific. Unpacked analyzers get automatic ICU normalization upgrades, too. This can reveal inconsistencies in language analysis across wikis. For example, CamelCase is split by English Wikipedia, but not French Wikipedia (see T219108).
So, let's compare the top 10-20 analyzers and see how they treat a mixed-language sample of documents (from, say. those Wikipedias plus a handful of docs from each of the top 100 Wikipedias) and look for ways to increase consistency. (It may not be necessary to do a 10- or 20-way comparison; we may be able to take a survey of components, remove the language-specific ones, and see what differences the remainder cause for different kinds of text.)
Likely necessary steps for eventual harmonization:
- Unpack all current monolithic analyzers. (With care, Greek wasn't trivial, for example.)
- Investigate and file tickets or patches upstream for non-Elastic analyzers that cannot be unpacked.
- T170625 Figure out what to do with word_break_helper.
- Find or create a plugin to identify acronyms (N.A.S.A.) and de-periodize them (NASA); compare tokenizing wikipedia.org
- T219108 Figure out whether aggressive_splitting makes sense everywhere.
- T180387 Look into enabling hiragana/katakana mapping everywhere.
This may also require coming up with a clever way of configuring all of these options, since they may not make sense for all languages—for example, the hiragana/katakana mapping is probably undesirable on Japanese-language wikis—or may require custom ordering with respect to other analysis components. Hard-coded config for each language with language-specific components is possible, but not desirable.