Reindex Spanish-language wikis to enable unpacked version of Spanish analysis
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	May 13 2021, 6:18 PM

Description

Follow up to activate changes from T277699, which includes ICU normalization, ICU folding, and homoglyph normalization.

(12 wikis use Spanish at last check.)

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T277699 Unpack Spanish Elasticsearch Analyzer
Resolved	TJones	T282808 Reindex Spanish-language wikis to enable unpacked version of Spanish analysis

Event Timeline

TJones created this task.May 13 2021, 6:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 13 2021, 6:18 PM

TJones added a parent task: T277699: Unpack Spanish Elasticsearch Analyzer.May 13 2021, 6:19 PM

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.

TJones renamed this task from Reindex Spanish-language wikis to Reindex Spanish-language wikis to enable unpacked version of Spanish analysis.May 13 2021, 6:21 PM

TJones updated the task description. (Show Details)

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.May 17 2021, 3:16 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• MPhamWMF set the point value for this task to 3.May 17 2021, 3:25 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones updated the task description. (Show Details)May 17 2021, 4:52 PM

TJones claimed this task.May 21 2021, 11:31 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Eh, I'm moving this back to in progress. The reindex is done but there's still a little analysis to do.

Unpacking + ICU Norm + ICU Folding Impact on Spanish Wikipedia

Summary

While unpacking an analyzer should have no impact on results, adding ICU folding definitely did for Spanish Wikipedia. The informal writing of queries often omits accents, which decreases recall. Folding those accents had a noticeable impact on the zero results rate, the total number of results returned, and the top result returned for many queries.

Background

I pulled a 10K sample of Spanish Wikipedia queries from February of 2021, and filtered 89 queries (porn, urls, and other junk) and randomly sampled 3000 queries from the remainder.
I used a brute-force strategy to attempt to detect the impact of reindexing on Spanish Wikipedia. I ran the 3000 queries against the live Wikipedia index every ten minutes (the run took about 9 minutes to complete) 6 times. When the reindexing finished, I stopped the 7th iteration because it was mixed and had just started; it started about 11 minutes after the 6th instead of the usual 10. I ran an 8th iteration as another control.
I compared each iteration against the subsequent one, and compared the 1st to the 6th (50 minutes apart) to get insight into "trends" vs "noise" in the comparisons.
I also ran some additional similar control tests in April and May to build and test my tools and to get a better sense of the expected variation.

Expected Results

Unpacking should have no impact on anything, but our automatic upgrades (currently homoglyph processing and ICU Normalization) can. I also enabled ICU folding. All of these can increase recall, though I did not expect a very noticeable impact.

Control Results

The number of queries getting zero results held steady at 19.3%
The number of queries getting a different number of results increases slightly over time (0.7% to 2.3% in 10 minute intervals; 5.2% over 50 minutes)
The number of queries getting fewer results is noise (0.1% to 1.4% in 10 minute intervals; 1.4% over 50 minutes)
The number of queries getting more results increases slightly over time (0.5% to 2.2% in 10 minute intervals; 3.8% over 50 minutes)
The number of queries changing their top result is noise (0.7% to 0.9% in 10 minute intervals; 0.7% over 50 minutes)
These results are also generally consistent with the control tests I ran in April and May.

Reindexing Results

The impact was much bigger than I expected, and seems to be driven largely by ICU folding. Acute accents in Spanish usually indicate unpredictable stress; some differentiate words that would otherwise be homographs. As such, they are less commonly used in informal writing (e.g., queries) than in formal writing (e.g., Wikipedia articles). Also, some names are commonly written with an accent, but the accent may be dropped by certain people in their own name. (On English Wikipedia, for example, Michelle Gomez and Michelle Gómez are different people.) Example new matches include cual/cuál, jose/josé, dia/día, gomez/gómez, peru/perú.
The zero results rate dropped to 18.9% (-0.4% absolute change; -2.1% relative change).
The number of queries getting a different number of results increased by 20.2% (vs. the 0.7%–2.4% range seen in control).
The number of queries getting fewer results was about 1½ times the max of the control range (2.1% vs 0.1%–1.4%). That's improbable but not impossible to still be random noise. I don't have any obvious explanation after looking at the queries in question.
The number of queries getting more results was 17.7% (vs the control range of 0.5%–2.2%). These are largely due to folding (with dia/día especially being a recurring theme). The biggest increases are not the former zero results queries.
The number of queries that changes their top result was 6.4% (vs. the control range of 0.7%–0.9%; that's at least a ~7x increase!). I looked at some of these, and some are definitely the result of folding allowing for matching words in the title of the top result. Others are less obvious, though I wonder if changed word stats (either within an article or across articles) may play a part.

Post-Reindex Control

The one control test I ran after reindexing showed changes approximately within the normal range, except for the changes in top result, which was 0 (vs 0.7–0.9%). This could be a statistical fluke, or a change in word stats from folding, or something else.

Gehel closed this task as Resolved.Jun 2 2021, 11:50 AM

Reindex Spanish-language wikis to enable unpacked version of Spanish analysisClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Reindex Spanish-language wikis to enable unpacked version of Spanish analysis
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...