Page MenuHomePhabricator

Reindex Spanish-language wikis to enable unpacked version of Spanish analysis
Closed, ResolvedPublic3 Estimated Story Points

Description

Follow up to activate changes from T277699, which includes ICU normalization, ICU folding, and homoglyph normalization.

(12 wikis use Spanish at last check.)

Event Timeline

TJones renamed this task from Reindex Spanish-language wikis to Reindex Spanish-language wikis to enable unpacked version of Spanish analysis.May 13 2021, 6:21 PM
TJones updated the task description. (Show Details)

Eh, I'm moving this back to in progress. The reindex is done but there's still a little analysis to do.

Unpacking + ICU Norm + ICU Folding Impact on Spanish Wikipedia

Summary

  • While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for Spanish Wikipedia. The informal writing of queries often omits accents, which decreases recall. Folding those accents had a noticeable impact on the zero results rate, the total number of results returned, and the top result returned for many queries.

Background

  • I pulled a 10K sample of Spanish Wikipedia queries from February of 2021, and filtered 89 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
  • I used a brute-force strategy to attempt to detect the impact of reindexing on Spanish Wikipedia. I ran the 3000 queries against the live Wikipedia index every ten minutes (the run took about 9 minutes to complete) 6 times. When the reindexing finished, I stopped the 7th iteration because it was mixed and had just started; it started about 11 minutes after the 6th instead of the usual 10. I ran an 8th iteration as another control.
  • I compared each iteration against the subsequent one, and compared the 1st to the 6th (50 minutes apart) to get insight into "trends" vs "noise" in the comparisons.
  • I also ran some additional similar control tests in April and May to build and test my tools and to get a better sense of the expected variation.

Expected Results

  • Unpacking should have no impact on anything, but our automatic upgrades (currently homoglyph processing and ICU Normalization) can. I also enabled ICU folding. All of these can increase recall, though I did not expect a very noticeable impact.

Control Results

  • The number of queries getting zero results held steady at 19.3%
  • The number of queries getting a different number of results increases slightly over time (0.7% to 2.3% in 10 minute intervals; 5.2% over 50 minutes)
  • The number of queries getting fewer results is noise (0.1% to 1.4% in 10 minute intervals; 1.4% over 50 minutes)
  • The number of queries getting more results increases slightly over time (0.5% to 2.2% in 10 minute intervals; 3.8% over 50 minutes)
  • The number of queries changing their top result is noise (0.7% to 0.9% in 10 minute intervals; 0.7% over 50 minutes)
  • These results are also generally consistent with the control tests I ran in April and May.

Reindexing Results

  • The impact was much bigger than I expected, and seems to be driven largely by ICU folding. Acute accents in Spanish usually indicate unpredictable stress; some differentiate words that would otherwise be homographs. As such, they are less commonly used in informal writing (e.g., queries) than in formal writing (e.g., Wikipedia articles). Also, some names are commonly written with an accent, but the accent may be dropped by certain people in their own name. (On English Wikipedia, for example, Michelle Gomez and Michelle Gómez are different people.) Example new matches include cual/cuál, jose/josé, dia/día, gomez/gómez, peru/perú.
  • The zero results rate dropped to 18.9% (-0.4% absolute change; -2.1% relative change).
  • The number of queries getting a different number of results increased by 20.2% (vs. the 0.7%–2.4% range seen in control).
  • The number of queries getting fewer results was about 1½ times the max of the control range (2.1% vs 0.1%–1.4%). That's improbable but not impossible to still be random noise. I don't have any obvious explanation after looking at the queries in question.
  • The number of queries getting more results was 17.7% (vs the control range of 0.5%–2.2%). These are largely due to folding (with dia/día especially being a recurring theme). The biggest increases are not the former zero results queries.
  • The number of queries that changes their top result was 6.4% (vs. the control range of 0.7%–0.9%; that's at least a ~7x increase!). I looked at some of these, and some are definitely the result of folding allowing for matching words in the title of the top result. Others are less obvious, though I wonder if changed word stats (either within an article or across articles) may play a part.

Post-Reindex Control

  • The one control test I ran after reindexing showed changes approximately within the normal range, except for the changes in top result, which was 0 (vs 0.7–0.9%). This could be a statistical fluke, or a change in word stats from folding, or something else.