Page MenuHomePhabricator

Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
Open, Needs TriagePublic3 Estimated Story Points

Description

Once T281379 is deployed in MediaWiki_1.37/wmf.8, we can reindex the relevant wikis, to activate ICU normalization, ICU folding, and homoglyph normalization.

Current counts are: German—9 wikis; Dutch—9 wikis, Portuguese—9 wikis (excluding 1 pt-br wiki).

Acceptance Criteria

  • All wikis in the relevant languages are reindexed
  • A before-and-after analysis for each language's Wikipedia is provided

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-06-22T13:29:00Z] <Trey314159> reindexing German wikis on elastic@eqiad, elastic@codfw, and cloudelastic complete (T284185)

Mentioned in SAL (#wikimedia-operations) [2021-06-22T16:41:19Z] <Trey314159> reindexing Dutch wikis on elastic@eqiad, elastic@codfw, and cloudelastic (T284185)

Mentioned in SAL (#wikimedia-operations) [2021-06-22T20:12:37Z] <Trey314159> reindexing Dutch wikis on elastic@eqiad, elastic@codfw, and cloudelastic complete (T284185)

Mentioned in SAL (#wikimedia-operations) [2021-06-22T20:12:55Z] <Trey314159> reindexing Portuguese wikis on elastic@eqiad, elastic@codfw, and cloudelastic (T284185)

Mentioned in SAL (#wikimedia-operations) [2021-06-23T00:02:32Z] <Trey314159> reindexing Portuguese wikis on elastic@eqiad, elastic@codfw, and cloudelastic complete (T284185)

TJones renamed this task from Reindex German, Dutch, and Portugese Wikis to Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions.Jun 23 2021, 10:03 PM

More details on Mediawiki: DE/NL/PT Reindexing Impacts.

Summary:

  • All
    • ICU Folding is the biggest driver of zero-results improvements and increased number of results, usually because of missing diacritics.
    • Reindexing seems to decrease discrepancies between shards, which lowers the incidence rate of queries with a different top result
    • Improved filtering, because in German 6 consonants in a row is perfectly normal; also improved URL and email detection, and filtered very long queries with zero results (another form of junk).
  • German (3K sample)
    • I also disabled the folding of ß to ss in the plain field, as per T87136.
    • Zero results rate dropped by -0.3% absolute change; -1.4% relative change
    • 11.5% of queries got more results in general
    • The query was heisst s.w.a.t. ("what does S.W.A.T. do?") dropped from 3369 down to 67 results because ß and ss no longeer match in the plain field, and s.w.a.t. is not broken up in the text field (word_break_helper is not enabled)
  • Dutch (3K sample)
    • Zero results rate dropped by -0.1% absolute change; -0.4% relative change
    • 8.0% of queries got more results in general
  • Portuguese (3K sample)
    • Zero results rate dropped by -0.4% absolute change; -2.1% relative change
    • 15.3% of queries got more results in general
    • Missing tildes (a instead of ã, or o instead of õ) are the biggest sources of changes
    • The query 1926~ got a lot more hits, but I don't know why. Weird.