Once all of the parent tasks listed below are complete/deployed, we need to reindex the Khmer-language wikis to enable the new Khmer reordering plugin to improve search on those wikis.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Invalid | None | T218613 {EPIC} Search: Local Impact. Making making bigger improvements for smaller (or underrepresented) communities | |||
| Resolved | TJones | T185721 Null or inconsistent search results using Khmer script | |||
| Resolved | RKemper | T274203 Build Extra Plugin with extra-analysis-khmer and deploy to Maven Central | |||
| Resolved | RKemper | T274204 Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster | |||
| Resolved | TJones | T274205 Reindex Khmer wikis to enable Khmer syllable reordering |
Event Timeline
I'm picking this up so I can try out some impact measuring tools.. and this is going to be much faster to reindex than English!
Khmer Reordering Before and After Reindexing Report
Data: kmwiki 1K sample from 2021-02-01 to 2021-03-01
- 319 (31.9%) queries originally got zero results
- 29 (2.9%) went from 0 results to some results
- from 1 to 209 new results
- 29 (2.9%) went from 0 results to some results
- 280 (28%) got a different number of hits
- from 1 to 678 more hits
- 256 (25.6%) increased from non-zero to more results
- from 0.10% (1994 to 1996) - 11400.00% (1 to 115)
- 95 (9.5%) changed their top result (including ZRR changes)
Observations:
- The largest change in results was for a numeric query (8) which when from 677 hits to 1355 hits because we are mapping Khmer numerals to Arabic numerals.
- I checked the 8 zero-results queries that had the biggest numbers of new results, and 7 of them had the kinds of problems the Khmer syllable reordering was intended to correct:
- split vowels
- repeat diacritics
- deprecated characters
English: I ran the same analysis on a 1k sample of enwiki from 2021-02-01 to 2021-02-08 as a control with roughly the same time between before and after. English Wikipedia is probably as much or more active than most Wikipedias, so it gives a likely upper bound on random "natural" changes.
- 173 (17.3%) queries originally got zero results
- 0 (0%) went from 0 results to some results
- 133 (13.3%) got a different number of hits
- from 16 fewer to 15 more hits
- 33 (3.3%) decreased from non-zero to fewer results
- from -1.89% (53 to 52) to -0% (122,708 to 122,707)
- 100 (10%) increased from non-zero to more results
- from +0% (1,640,797 to 1,640,802) to 1.44% (277 to 281)
- 33 (3.3%) changed their top result
Observations:
- I checked a handful of the queries that changed their top result, and their top result changed randomly as I reloaded the search results page.
Khmer vs English Analysis:
- Improvements to Khmer zero results queries are likely a direct result of Khmer plugin!
- The range of number of hits for Khmer (1 to 678) is very different from the "random" changes in English (-16 to 15), and so is also likely a direct result of Khmer plugin!
Looks like the Khmer plugin is going to have a pretty big impact on Khmer searches!
This report is awesome! And it's great to see some evidence that the Khmer plugin is working as intended. Thanks for doing this!
The report, with more detail on the filtering of the queries to get the final 1K sample, is now on Mediawiki.