|Open||None||T219550 [EPIC] Harmonize language analysis across languages|
|Open||None||T272606 [EPIC] Unpack all Elasticsearch analyzers|
|Resolved||TJones||T277699 Unpack Spanish Elasticsearch Analyzer|
|Resolved||TJones||T282808 Reindex Spanish-language wikis to enable unpacked version of Spanish analysis|
Unpacking + ICU Norm + ICU Folding Impact on Spanish Wikipedia
- While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for Spanish Wikipedia. The informal writing of queries often omits accents, which decreases recall. Folding those accents had a noticeable impact on the zero results rate, the total number of results returned, and the top result returned for many queries.
- I pulled a 10K sample of Spanish Wikipedia queries from February of 2021, and filtered 89 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
- I used a brute-force strategy to attempt to detect the impact of reindexing on Spanish Wikipedia. I ran the 3000 queries against the live Wikipedia index every ten minutes (the run took about 9 minutes to complete) 6 times. When the reindexing finished, I stopped the 7th iteration because it was mixed and had just started; it started about 11 minutes after the 6th instead of the usual 10. I ran an 8th iteration as another control.
- I compared each iteration against the subsequent one, and compared the 1st to the 6th (50 minutes apart) to get insight into "trends" vs "noise" in the comparisons.
- I also ran some additional similar control tests in April and May to build and test my tools and to get a better sense of the expected variation.
- Unpacking should have no impact on anything, but our automatic upgrades (currently homoglyph processing and ICU Normalization) can. I also enabled ICU folding. All of these can increase recall, though I did not expect a very noticeable impact.
- The number of queries getting zero results held steady at 19.3%
- The number of queries getting a different number of results increases slightly over time (0.7% to 2.3% in 10 minute intervals; 5.2% over 50 minutes)
- The number of queries getting fewer results is noise (0.1% to 1.4% in 10 minute intervals; 1.4% over 50 minutes)
- The number of queries getting more results increases slightly over time (0.5% to 2.2% in 10 minute intervals; 3.8% over 50 minutes)
- The number of queries changing their top result is noise (0.7% to 0.9% in 10 minute intervals; 0.7% over 50 minutes)
- These results are also generally consistent with the control tests I ran in April and May.
- The impact was much bigger than I expected, and seems to be driven largely by ICU folding. Acute accents in Spanish usually indicate unpredictable stress; some differentiate words that would otherwise be homographs. As such, they are less commonly used in informal writing (e.g., queries) than in formal writing (e.g., Wikipedia articles). Also, some names are commonly written with an accent, but the accent may be dropped by certain people in their own name. (On English Wikipedia, for example, Michelle Gomez and Michelle Gómez are different people.) Example new matches include cual/cuál, jose/josé, dia/día, gomez/gómez, peru/perú.
- The zero results rate dropped to 18.9% (-0.4% absolute change; -2.1% relative change).
- The number of queries getting a different number of results increased by 20.2% (vs. the 0.7%–2.4% range seen in control).
- The number of queries getting fewer results was about 1½ times the max of the control range (2.1% vs 0.1%–1.4%). That's improbable but not impossible to still be random noise. I don't have any obvious explanation after looking at the queries in question.
- The number of queries getting more results was 17.7% (vs the control range of 0.5%–2.2%). These are largely due to folding (with dia/día especially being a recurring theme). The biggest increases are not the former zero results queries.
- The number of queries that changes their top result was 6.4% (vs. the control range of 0.7%–0.9%; that's at least a ~7x increase!). I looked at some of these, and some are definitely the result of folding allowing for matching words in the title of the top result. Others are less obvious, though I wonder if changed word stats (either within an article or across articles) may play a part.
- The one control test I ran after reindexing showed changes approximately within the normal range, except for the changes in top result, which was 0 (vs 0.7–0.9%). This could be a statistical fluke, or a change in word stats from folding, or something else.