Page MenuHomePhabricator

Refresh RT-testing test pages to change the mix of pages and add small set of pages from wikitionary and other projects
Closed, ResolvedPublic

Assigned To
Authored By
ssastry
Jun 10 2015, 5:43 AM
Referenced Files
F191820: shuffle.sh
Jul 13 2015, 5:08 PM
F191821: jsonify.js
Jul 13 2015, 5:08 PM
F191819: download.sh
Jul 13 2015, 5:08 PM

Description

Early in Parsoid's development, we started off RT-testing with 100K pages from enwiki. Sometime in 2013, we switched to 160K pages with 10K pages each from 16 different wikis. At this time, we are almost close to being done with fixing the most important semantic failures in this set -- all that is left now are some edge cases and diffs resulting from wikitext errors that we aren't going to support.

In the light of wanting to deploy VE to enwiki, we should do another refresh of the RT-testing pages, but introduce a bigger pool of pages from enwiki and proportionately reduce the set of pages from smaller wikis (instead of 10K from all wikis). We should likewise, introduce a small set of pages (1K each?) from a few different wiktionaries and other non-wikipedia wikis to uncover use cases specific to those wikis (as in T101599).

Event Timeline

ssastry raised the priority of this task from to High.
ssastry updated the task description. (Show Details)
ssastry added projects: Parsoid, Parsoid-Tests.
ssastry added a subscriber: ssastry.
ssastry set Security to None.

To generate the original list I used:

  1. to download the titles lists of the various wikis
  2. to shuffle the lists of titles and extract the top N
  3. to turn the one-title-per-line files into a JSON file suitable for importing with Parsoid/tests/server/importJson.js

I still have the original shuffled files, so if you wanted to tweak our title set without losing all consistency with previous results I could extract the top M (for M != the original N) lines from those files.

Alternatively, we could use the "top 1000" articles, based on hit count. https://meta.wikimedia.org/wiki/Datasets lists a number of different sources for traffic data, but so far every one I've tried has been a dead link. :( I was pretty sure there was a WMF-generated page hit statistics data source somewhere, though.

gwicke suggested that an alternative to picking a random sample is to pick the most recently edited N pages for each wiki and that we can get those from /a/mw-log/runJobs.log on fluorine

But, so as not to skew the list towards the popular pages based on current events, we should probably pick half the pages from the recent edit list and pick the other half using random sampling.

Looking at the jobs logs on fluorine is not going to work since it is skewed by bot edits. Better to get the data from the RC stream via the API.

I updated the rt-testing database with a new set of titles.

  • 28 wikis and 2 wiktionaries are now represented
  • 30K titles from enwiki, 10K each from dewiki nlwiki frwiki itwiki ruwiki eswiki ... and progressively smaller from the rest -- 160K total (see full list below)
  • 70% randomly chosen from the latest dump of titles; 30% from the recent changes stream -- gabriel made a pertinent observation that we should use pages from the RC stream since they are more representative of pages being edited. However, I didn't want to pick everything from the RC stream since we still want to support the full mix of pages on the wikis -- not just those being actively edited.
  • I've restarted rt-testing to init the base data before we merge any new patches.
 30000 enwiki.all_titles.txt
 10000 dewiki.all_titles.txt
 10000 nlwiki.all_titles.txt
 10000 ruwiki.all_titles.txt
 10000 frwiki.all_titles.txt
 10000 eswiki.all_titles.txt
 10000 itwiki.all_titles.txt
  8000 svwiki.all_titles.txt
  8000 plwiki.all_titles.txt
  8000 jawiki.all_titles.txt
  7000 arwiki.all_titles.txt
  7000 kowiki.all_titles.txt
  7000 hiwiki.all_titles.txt
  7000 hewiki.all_titles.txt
  5000 zhwiki.all_titles.txt
  1000 enwiktionary.all_titles.txt
  1000 frwiktionary.all_titles.txt
  1000 mznwiki.all_titles.txt
  1000 uzwiki.all_titles.txt
  1000 iswiki.all_titles.txt
  1000 hywiki.all_titles.txt
  1000 kawiki.all_titles.txt
  1000 ukwiki.all_titles.txt
   956 pnbwiki.all_titles.txt
   948 ckbwiki.all_titles.txt
   859 cvwiki.all_titles.txt
   808 kaawiki.all_titles.txt
   715 lnwiki.all_titles.txt
   707 cuwiki.all_titles.txt
   705 lbewiki.all_titles.txt
160698 total

https://gerrit.wikimedia.org/r/#/c/225328/ has the scripts used to generate the new db -- useful for the next time we need to do this.

The new base rt-testing accuracy with the new db is 99.81% semantic, 74.2% syntactic. With the old db, the corresponding numbers were 99.95% and 85.2%. One of the reason for the bigger drop in syntactic accuracy is because over 2 years, the old db seems to have accumulated a lot of redirect pages which are trivial pages that rt perfectly. I expect the same thing to happen with this new db over time.