Page MenuHomePhabricator

Find out why wkidata dumps are slowing down
Closed, ResolvedPublic

Description

Wikidatawiki dumps have been getting much slower than expected. This needs to be resolved asap, as soon we will not have capacity to deal with the slowdown.

Event Timeline

ArielGlenn triaged this task as High priority.

I've done some preliminary investigation. First findings:

  • About 2 million revisions per job (of 4 jobs running concurrently) are added each month. Not enough for the slowdown.
  • Previous dumps are not missing pages or revisions so it's not a matter of missing content.
  • The slowdown occurs for both old and more recent revisions.
  • In March 1.5 million revs were handled in the same length of time as 1 million in June.
  • File size doesn't seem to play a role, some files from June are bigger and some smaller than March runs.

And the most interesting but needs in depth investigation to see what was happening with earlier runs:

  • It takes between 25 to 35 seconds to dump 1k revisions, and most of them are not prefetched.

Given that I've checked that most pages are indeed in the previous month's run, that's really problematic.

Cause found. Fix not so clear.

In a nutshell: virtually the same pages and revisions are included in the same order in the old and new files, content is the same (except for new revisions and an occassional change in revid order). However, whereas revision ids used to be recorded in more or less increasing order per page, they are now recorded in a much more random order. Our prefetch code has expected sequentially increasing revids since the beginning of time.

The changed order manifests itself first in the stubs files; we have a longstanding bug open about this (T29112) , without resolution. We need to at least address this issue for the standard stub dumps, even if it requires a special option on the command line.

See also T138290 about rapid growth of Wikidata in comparison to all other wikis. (Data forthcoming)

Change 296742 had a related patch set uploaded (by ArielGlenn):
do xml stubs dump pieces based on revs per page range

https://gerrit.wikimedia.org/r/296742

The changeset runs well, BUT. Due to T132416 I can't run it across all wikis; some hosts have the rev_page_id index for a given wiki and some do not. For wikidata in particular we are almost good; all the eqiad servers but db1026 have the index, so even if the pool gets changed around we are going to be good. Thus I have embedded in the changeset the hardcoded name of the wiki to be checked against (ewww!) for testing, and it works properly.

A sample run of stubs of the first 50k pages (about 35 or 40 million revisions out of the total 350 million) took about 100 minutes. We run four parallel jobs so we're looking at roughly around 5 hours for the job to complete. On the content dumps we should be saving several days plus load on the dbs, so definitely worth it.

Full dumps would normally run tomorrow but I've set the cron job to delay until I move that hardcoded eww into the wiki dump config files. Probably I will be able to re-enable the cron job so that the full dumps start tomorrow evening.

Change 296742 merged by ArielGlenn:
add ability to do xml stubs dump pieces based on revs per page range

https://gerrit.wikimedia.org/r/296742

It's now tomorrow. I have updated the dump config files to add settings for if and how we order by revision id in the stubs dumps (see https://gerrit.wikimedia.org/r/#/c/296892/ https://gerrit.wikimedia.org/r/#/c/296897/ https://gerrit.wikimedia.org/r/#/c/296907/ )
I have fixed up and merged the changes to the dump scripts after full testing on wikidata and another wiki, against the new config files.
And last but not least I have re-enabled the full dump cron job for today so it should kick off tonight.

Next, after stubs run ok, will be to do a manual run of pages-meta-history with the March wikidata dumps for prefetch, since they are ordered mostly correctly afaict.

Well even the createdir jobs are not running properly. That's no good and it's a bit late in the day for me to coherently debug anything so I've shot the cron scripts and will have a look tomorrow. Hopefully it's something quick and easy in the config that I overlooked.

https://gerrit.wikimedia.org/r/#/c/297104/ fixed it, undrelated to stubs or anything else, just a silly typo when I moved the flow history jobs into the script. Running the createdir jobs now, the stubs wil kick off tomorrow morning.

Stubs have completed; I am checking them to be sure the ordering is correct. Next up will be the manual page history content dumps with manual setting of the prefetch files to use those from the March run.

Ordering is as it should be, increasing by rev_id for each page.

Change 298254 had a related patch set uploaded (by ArielGlenn):
add argument for specifying date of dump to use for prefetch

https://gerrit.wikimedia.org/r/298254

Change 298254 merged by ArielGlenn:
add argument for specifying date of dump to use for prefetch

https://gerrit.wikimedia.org/r/298254

Manual run in a root screen session on snapshot1007 has been started: after shooting the worker bash script, the worker.py script and its children for the wikidatawiki dump, did

su - datasets
cd /srv/deployment/dumps/dumps/xmldumps-backup
python ./worker.py --configfile /etc/dumps/confs/wikidump.conf.bigwikis --date last --skipdone \
        --exclusive --prefetchdate 20160305 --log --job metahistorybz2dump wikidatawiki

This has acquired the dumps lock for the wikidata wiki so no other jobs will run. When it completes, I will need to run the metahistory7z step by hand and that will complete this run.

Future runs will use the regular means to find a prefetch file; this manual step was only necessary so that we use a prefetch file where the revs are ordered mostly by rev_id within pages, something which is not true for the dumps from April through June. If we used an out of order prefetch file we would wind up requesting most revision content from the database, slower for us and too much of a burden on the db server.