Page MenuHomePhabricator

[EPIC] Harmonize language analysis across languages
Open, MediumPublic

Description

Various languages have additional features and filters enabled that are not particularly language-specific. Unpacked analyzers get automatic ICU normalization upgrades, too. This can reveal inconsistencies in language analysis across wikis. For example, CamelCase is split by English Wikipedia, but not French Wikipedia (see T219108).

So, let's compare the top 20+ analyzers and see how they treat a mixed-language sample of documents (from, say. those Wikipedias plus a handful of docs from each of the top 100 Wikipedias) and look for ways to increase consistency. (It may not be necessary to do a 20-way comparison; we may be able to take a survey of components, remove the language-specific ones, and see what differences the remainder cause for different kinds of text.)

Likely necessary steps for eventual harmonization:

  • T272606 Unpack all current monolithic analyzers. (With care, Greek wasn't trivial, for example.)
    • Investigate and file tickets or patches upstream for non-Elastic analyzers that cannot be unpacked.
  • T315118 Handle apostrophe-like characters better
  • T170625 Figure out what to do with word_break_helper.
    • Find or create a plugin to identify acronyms (N.A.S.A.) and de-periodize them (NASA); compare tokenizing wikipedia.org
  • T219108 Figure out whether aggressive_splitting makes sense everywhere.
  • T180387 Look into enabling hiragana/katakana mapping everywhere.
  • T332337 Put back together some multi-script tokens split by the icu_tokenizer (e.g., NGi, И, XNGiИX or Ko, Я, nKoЯn)
  • T332342 See if it makes sense to standardize ASCII folding/ICU folding; some languages have ASCII folding disabled, some have it enabled, some have it enabled with the option to preserve the unfolded original, some upgrade ASCII folding (with or without preserving the original) to ICU folding.
  • Refactor existing analysis configs to use AnalyzerBuilder where possible (some may happen incidentally as part of the above), possibly including for the default config.
  • T358495 Enable dotted_I_fix (almost) everywhere and investigate enabling Turkish lowercase for languages that distinguish I/ı and İ/i.

This may also require coming up with a clever way of configuring all of these options, since they may not make sense for all languages—for example, the hiragana/katakana mapping is probably undesirable on Japanese-language wikis—or may require custom ordering with respect to other analysis components. Hard-coded config for each language with language-specific components is possible, but not desirable.

The new(ish) AnalyzerBuilder will make some of this much more orderly and understandable, while also making it easy to update the defaults for almost every language in one go.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedGehel
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedRKemper
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedRKemper
ResolvedTJones
ResolvedEBernhardson
ResolvedTJones
OpenTJones
ResolvedTJones
ResolvedTJones

Event Timeline

TJones renamed this task from Harmonize language analysis across languages to [EPIC] Harmonize language analysis across languages.Aug 27 2020, 8:13 PM
TJones added a project: Epic.
TJones moved this task from Language Stuff to [epic] on the Discovery-Search board.

Is this something we should report in Tech News, in that it will have some small effect on search results? Or is the user-facing effect too minimal and the benefits will mainly be seen on the backend side?

Is this something we should report in Tech News, in that it will have some small effect on search results? Or is the user-facing effect too minimal and the benefits will mainly be seen on the backend side?

I'm not sure if it is worthy of Tech News. There will be small improvements to search results in various languages, but, as with many search changes, the impact may be too minimal for anyone to notice in day-to-day use. There should be slightly fewer queries that get zero results for some languages, but plenty of queries will still get zero results. A few specific queries—particularly those where searchers write informally and omit certain "correct but not necessary" diacritics, or are trying to match non-native diacritics they can't easily type (as with T226812)—may get much better results. Other queries will get additional results, but no one will notice because they will (correctly) not be ranked very highly.

It's also hard to predict some of the improvements because it depends on how people type when they search—users who write queries more formally (i.e., with all the correct-but-not-necessary diacritics) will see fewer improvements as a group. Lazy typists (like me!) may see more benefit.

@Johan This appeared in TechNews and said

Searching on Wikipedia will find more results in some languages

my emphasis on Wikipedia.

Is this solely Wikipedias or will it be all WMF wikis? #AskingForAllTheNonWikipedias

Is this solely Wikipedias or will it be all WMF wikis? #AskingForAllTheNonWikipedias

All improvements to the language analysis will be for all wikis in that language. (Taking into account some fine gradations, such as Portuguese and Brazillian Portugese counting as separate languages.)

The impact of those improvements will depend on the contents of the wiki and the behavior of searchers. I'm currently using a sample of articles from the relevant Wikipedia and Wiktionary to test the changes before they are deployed, and a sample of Wikipedia queries to assess the impact of the changes after they are deployed, so Wikipedia does/will have the most well-measured changes, but the changes will apply on all wikis set to a given language.

Thanks. One never knows whether the use of the term "Wikipedia" is purposeful or not. It is confusing when sometimes it is used interchangeably and sometimes not.

I would think that Wikidata would be one where this is quite desired as it is a multi-lingual wiki where no one is not proficient with all the local different character sets.

@Billinghurst Yeah, that was a mental slip from my side. My apologies.

@RoySmith I was unable to replicate these issues just now. If they are still causing problems, please feel free to file a ticket for the search team to look into. Thanks!

Change 941060 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Add remove_duplicates to Hebrew and refactor

https://gerrit.wikimedia.org/r/941060

While harmonizing, I noticed that the Hebrew analysis chain was creating a lot of duplicate tokens. Adding a remove_duplicates filter removed 19.7% (Wikipedia) to 22.7% (Wiktionary) of all tokens—all non-Hebrew and many Hebrew tokens were duplicated! Did a lot of refactoring (checked off the task above!), too.

Small write up on MediaWiki.

Change 941060 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add remove_duplicates to Hebrew and refactor

https://gerrit.wikimedia.org/r/941060