Page MenuHomePhabricator

Reindex all wikis to enable folding harmonization and new functionality
Closed, ResolvedPublic3 Estimated Story Points

Description

Once T332342 is deployed (probably by the week of Sept 30, 2024), we can reindex everything to make sure all the folding harmonization is in effect.

Technically, there are 31 languages with new ICU folding enabled, 5 languages (all ones with big wikis, including English) that had text and plain harmonized, and all the previous ICU-enabled languages have had icu_folding added to lowercase_keyword.

The previous custom languages cover most of the high-volume wikis, and the 31 new languages were partially sorted by volume, so it is probably easier and less (human) work—and only slightly more computational effort—to do a full reindex than to pick-n-choose the right subset without error. A full reindex will propagate the changes to multi-language indexes like Commons and Wikidata, too.

Also perform a before-and-after analysis for the targeted wikis that have had ICU folding added.

Event Timeline

dr0ptp4kt set the point value for this task to 3.

Full write up and table of stats on Mediawiki.

  • 100% of the 25 non-small language samples (30+ queries) that had tokens affected by folding showed improvement in their zero-results rate, and almost all of small samples did, too.
  • In the general sample (unweighted but filtered samples for each language Wikipedia), 84% showed improvements in ZRR, number of results, or top result, indicating that "foreign" diacritics and other variant characters are relatively common (≥ 0.1% of queries) in general search.

Side note on weighted vs unweighted ZRR for @EBernhardson and anyone else interested: The diffs between weighted and unweighted query count and ZRR was often 0 but usually less than 10% (for ZRR that's only a few percentage points.. e.g., 30% vs 33% ZRR). Vietnamese is a wild outlier, with a social media scandal post title repeated more than 12K times (plus four other similar queries repeated over 100 times) leading to 1K unweighted vs 15K weighted queries, and 19% unweighted vs 82% weighted ZRR. Better filtering or bot detection or something is probably helpful in such cases! (These are longish queries, too, but not long enough or junky enough to get caught in the "junky long query" filter I already use.) A few more details are in the writeup on-wiki.