Analyze results of harmonization
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Mar 4 2024, 7:50 PM

Description

After reindexing (T342444) is complete, analyze the impact of the changes on various samples of wiki queries.

I have general samples from over 100 wikis, many with task-specific sub-samples with examples of queries that should be affected by apostrophe normalization, camelCase handling, acronym handling, updating word_break_helper, and enabling the icu_tokenizer with icu_tokenizer_repair.

I've been running daily regression tests with these samples since before reindexing began, so I should be able to detect changes from the day of reindexing, and compare that to typical day-to-day changes.

We are generally looking for increased recall in the task-speciifc sub-samples to see how many languages that have examples of a phenomena see an improvement. I will also quickly look at changes in the general sample for a sense of overall impact from these harmonization efforts.

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	TJones	T332337 Repair multi-script tokens split by the ICU tokenizer
Resolved	RKemper	T356651 Rebuild and deploy textify plugin
Resolved	TJones	T356643 Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair
Resolved	TJones	T353377 CirrusSearchIndexTooOld
Resolved	EBernhardson	T342444 Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair
Resolved	TJones	T359100 Analyze results of harmonization

Event Timeline

TJones created this task.Mar 4 2024, 7:50 PM

TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Mar 21 2024, 12:01 AM

Full write-up (and it's a lot!) is on MediaWiki.

Summary Results

	ZRR↓	Res↑	topΔ	topΔ+	ZRR↑	ZRR↑+	noΔ	net+
general	66%	24%	—	—	7%	6%	7%	93%
acronym	68%	18%	5%	3%	10%	10%	—	98%
apostrophe	63%	25%	6%	6%	31%	25%	—	94%
camelCase	100%	—	—	—	—	—	—	100%
ICU tokens	33%	—	—	—	67%	67%	—	100%
wbh	80%	12%	6%	2%	—	—	2%	94%

ZRR↓: zero-results rate decreased (good)
Res↑: number of queries with more results increases (good)
topΔ: top result changes (unclear)
topΔ+: top result changes, checked (good)
ZRR↑: zero-results rate increased (possibly bad)
ZRR↑+: zero-results rate increased due to improved precision (good)
noΔ: nothing changed (not good)
net+: net percent of samples with a good outcome

Our targe goal of 75% improvement in either zero-results rate (ZRR↓, preferred) or total results returned (Res↑) held true for all but the introduction of ICU tokenization, where we expected (and found) improvements in both recall and precision, depending on which "foreign" script a query and text on-wiki is in.

For each collection of targeted queries for a given feature, we saw an improvement across more than 90% of query samples.

For the general random sample of Wikipedia queries, we also saw improvement across more than 90% of language samples, mostly in direct improvements to the zero-results rate—indicating that the features we are introducing to all (or almost all) wikis are useful upgrades!

TJones triaged this task as High priority.Mar 29 2024, 2:47 PM

TJones set the point value for this task to 3.

Gehel closed this task as Resolved.Apr 5 2024, 1:01 PM

Analyze results of harmonizationClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Analyze results of harmonization
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...