Page MenuHomePhabricator

[EPIC] Harmonize language analysis across languages
Open, MediumPublic

Description

Various languages have additional features and filters enabled that are not particularly language-specific. Unpacked analyzers get automatic ICU normalization upgrades, too. This can reveal inconsistencies in language analysis across wikis. For example, CamelCase is split by English Wikipedia, but not French Wikipedia (see T219108).

So, let's compare the top 10-20 analyzers and see how they treat a mixed-language sample of documents (from, say. those Wikipedias plus a handful of docs from each of the top 100 Wikipedias) and look for ways to increase consistency. (It may not be necessary to do a 10- or 20-way comparison; we may be able to take a survey of components, remove the language-specific ones, and see what differences the remainder cause for different kinds of text.)

Likely necessary steps for eventual harmonization:

  • Unpack all current monolithic analyzers. (With care, Greek wasn't trivial, for example.)
    • Investigate and file tickets or patches upstream for non-Elastic analyzers that cannot be unpacked.
  • T170625 Figure out what to do with word_break_helper.
    • Find or create a plugin to identify acronyms (N.A.S.A.) and de-periodize them (NASA); compare tokenizing wikipedia.org
  • T219108 Figure out whether aggressive_splitting makes sense everywhere.
  • T180387 Look into enabling hiragana/katakana mapping everywhere.

This may also require coming up with a clever way of configuring all of these options, since they may not make sense for all languages—for example, the hiragana/katakana mapping is probably undesirable on Japanese-language wikis—or may require custom ordering with respect to other analysis components. Hard-coded config for each language with language-specific components is possible, but not desirable.

Event Timeline

TJones renamed this task from Harmonize language analysis across languages to [EPIC] Harmonize language analysis across languages.Aug 27 2020, 8:13 PM
TJones added a project: Epic.
TJones moved this task from Language Stuff to [epic] on the Discovery-Search board.

Is this something we should report in Tech News, in that it will have some small effect on search results? Or is the user-facing effect too minimal and the benefits will mainly be seen on the backend side?

Is this something we should report in Tech News, in that it will have some small effect on search results? Or is the user-facing effect too minimal and the benefits will mainly be seen on the backend side?

I'm not sure if it is worthy of Tech News. There will be small improvements to search results in various languages, but, as with many search changes, the impact may be too minimal for anyone to notice in day-to-day use. There should be slightly fewer queries that get zero results for some languages, but plenty of queries will still get zero results. A few specific queries—particularly those where searchers write informally and omit certain "correct but not necessary" diacritics, or are trying to match non-native diacritics they can't easily type (as with T226812)—may get much better results. Other queries will get additional results, but no one will notice because they will (correctly) not be ranked very highly.

It's also hard to predict some of the improvements because it depends on how people type when they search—users who write queries more formally (i.e., with all the correct-but-not-necessary diacritics) will see fewer improvements as a group. Lazy typists (like me!) may see more benefit.

@Johan This appeared in TechNews and said

Searching on Wikipedia will find more results in some languages

my emphasis on Wikipedia.

Is this solely Wikipedias or will it be all WMF wikis? #AskingForAllTheNonWikipedias

Is this solely Wikipedias or will it be all WMF wikis? #AskingForAllTheNonWikipedias

All improvements to the language analysis will be for all wikis in that language. (Taking into account some fine gradations, such as Portuguese and Brazillian Portugese counting as separate languages.)

The impact of those improvements will depend on the contents of the wiki and the behavior of searchers. I'm currently using a sample of articles from the relevant Wikipedia and Wiktionary to test the changes before they are deployed, and a sample of Wikipedia queries to assess the impact of the changes after they are deployed, so Wikipedia does/will have the most well-measured changes, but the changes will apply on all wikis set to a given language.

Thanks. One never knows whether the use of the term "Wikipedia" is purposeful or not. It is confusing when sometimes it is used interchangeably and sometimes not.

I would think that Wikidata would be one where this is quite desired as it is a multi-lingual wiki where no one is not proficient with all the local different character sets.

@Billinghurst Yeah, that was a mental slip from my side. My apologies.