Page MenuHomePhabricator

Reindex Khmer wikis to enable Khmer syllable reordering
Closed, ResolvedPublic1 Estimated Story Points

Description

Once all of the parent tasks listed below are complete/deployed, we need to reindex the Khmer-language wikis to enable the new Khmer reordering plugin to improve search on those wikis.

Event Timeline

TJones renamed this task from Reindex Khmer wikis to enable to Reindex Khmer wikis to enable Khmer syllable reordering.Feb 8 2021, 10:12 PM
TJones updated the task description. (Show Details)
Gehel set the point value for this task to 1.Feb 15 2021, 4:25 PM

I'm picking this up so I can try out some impact measuring tools.. and this is going to be much faster to reindex than English!

Khmer Reordering Before and After Reindexing Report

Data: kmwiki 1K sample from 2021-02-01 to 2021-03-01

  • 319 (31.9%) queries originally got zero results
    • 29 (2.9%) went from 0 results to some results
      • from 1 to 209 new results
  • 280 (28%) got a different number of hits
    • from 1 to 678 more hits
    • 256 (25.6%) increased from non-zero to more results
      • from 0.10% (1994 to 1996) - 11400.00% (1 to 115)
  • 95 (9.5%) changed their top result (including ZRR changes)

Observations:

  • The largest change in results was for a numeric query (8) which when from 677 hits to 1355 hits because we are mapping Khmer numerals to Arabic numerals.
  • I checked the 8 zero-results queries that had the biggest numbers of new results, and 7 of them had the kinds of problems the Khmer syllable reordering was intended to correct:
    • split vowels
    • repeat diacritics
    • deprecated characters

English: I ran the same analysis on a 1k sample of enwiki from 2021-02-01 to 2021-02-08 as a control with roughly the same time between before and after. English Wikipedia is probably as much or more active than most Wikipedias, so it gives a likely upper bound on random "natural" changes.

  • 173 (17.3%) queries originally got zero results
    • 0 (0%) went from 0 results to some results
  • 133 (13.3%) got a different number of hits
    • from 16 fewer to 15 more hits
    • 33 (3.3%) decreased from non-zero to fewer results
      • from -1.89% (53 to 52) to -0% (122,708 to 122,707)
    • 100 (10%) increased from non-zero to more results
      • from +0% (1,640,797 to 1,640,802) to 1.44% (277 to 281)
  • 33 (3.3%) changed their top result

Observations:

  • I checked a handful of the queries that changed their top result, and their top result changed randomly as I reloaded the search results page.

Khmer vs English Analysis:

  • Improvements to Khmer zero results queries are likely a direct result of Khmer plugin!
  • The range of number of hits for Khmer (1 to 678) is very different from the "random" changes in English (-16 to 15), and so is also likely a direct result of Khmer plugin!

Looks like the Khmer plugin is going to have a pretty big impact on Khmer searches!

This report is awesome! And it's great to see some evidence that the Khmer plugin is working as intended. Thanks for doing this!

The report, with more detail on the filtering of the queries to get the final 1K sample, is now on Mediawiki.