Page MenuHomePhabricator

Analyze results of harmonization
Closed, ResolvedPublic3 Estimated Story Points

Description

After reindexing (T342444) is complete, analyze the impact of the changes on various samples of wiki queries.

I have general samples from over 100 wikis, many with task-specific sub-samples with examples of queries that should be affected by apostrophe normalization, camelCase handling, acronym handling, updating word_break_helper, and enabling the icu_tokenizer with icu_tokenizer_repair.

I've been running daily regression tests with these samples since before reindexing began, so I should be able to detect changes from the day of reindexing, and compare that to typical day-to-day changes.

We are generally looking for increased recall in the task-speciifc sub-samples to see how many languages that have examples of a phenomena see an improvement. I will also quickly look at changes in the general sample for a sense of overall impact from these harmonization efforts.

Event Timeline

Full write-up (and it's a lot!) is on MediaWiki.

Summary Results

ZRR↓Res↑topΔtopΔ+ZRR↑ZRR↑+noΔnet+
general66%24%7%6%7%93%
acronym68%18%5%3%10%10%98%
apostrophe63%25%6%6%31%25%94%
camelCase100%100%
ICU tokens33%67%67%100%
wbh80%12%6%2%2%94%
  • ZRR↓: zero-results rate decreased (good)
  • Res↑: number of queries with more results increases (good)
  • topΔ: top result changes (unclear)
  • topΔ+: top result changes, checked (good)
  • ZRR↑: zero-results rate increased (possibly bad)
  • ZRR↑+: zero-results rate increased due to improved precision (good)
  • noΔ: nothing changed (not good)
  • net+: net percent of samples with a good outcome

Our targe goal of 75% improvement in either zero-results rate (ZRR↓, preferred) or total results returned (Res↑) held true for all but the introduction of ICU tokenization, where we expected (and found) improvements in both recall and precision, depending on which "foreign" script a query and text on-wiki is in.

For each collection of targeted queries for a given feature, we saw an improvement across more than 90% of query samples.

For the general random sample of Wikipedia queries, we also saw improvement across more than 90% of language samples, mostly in direct improvements to the zero-results rate—indicating that the features we are introducing to all (or almost all) wikis are useful upgrades!

TJones triaged this task as High priority.Mar 29 2024, 2:47 PM
TJones set the point value for this task to 3.