Page MenuHomePhabricator

Wikidata dumps full run takes too long with current config
Closed, ResolvedPublic

Description

The revision content full history step started a few days later this month than last month. These minor shufflings in order happen all the time and should not result in the job running over time, which it will certainly do this month if I don't intervene manually.

Related Objects

Event Timeline

There's already a patchset in the works, https://gerrit.wikimedia.org/r/#/c/355100/ which will start the wikidata dumps run first, dedicating many more processes to completing it. This means we will have one host running enwiki, a second running wikidatawiki, and the third snapshot host running all of the small and big wikis alone until the first two hosts complete their dedicated tasks.

This will mean a delay of a few days before stub (metadata) runs for all wikis complete at the beginning of the month. This is probably ok for Erik/analytics, who initially requesed that the stubs be run before anything else, but we should look into getting another snapshot host in any case.

The delay, however, means that mediawikiwiki will start later, which almost certainly means that the flow history job for mediawikiwiki will not complete in time for the second run of the month. So that issue T164262: Make flow dumps run faster must be resolved, or we must have a definite timetable for it to be resolved, before the wikidata config change can go live.

In the meantime I will start 7z compression on existing wikidatawiki bz2 revision history content files manually, while the history content job runs to completion.

Running out of a root screen session on snapshot1005 (otherwise idle) as datasets user. 10 files being recompressed at a time, doing pages-meta-history{1,2,3}*bz2 right now. Later when that's complete I'll do the pages-meta-history4*bz2 files that are ready.

Change 359907 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] script to batch 7z recompress revision content history files manually

https://gerrit.wikimedia.org/r/359907

Wikidata 7z files are done, running a noop job to clean up, generate hashes and rss feed files. Squeaking in just before the deadline.

Change 359907 merged by ArielGlenn:
[operations/dumps@master] script to batch 7z recompress revision content history files manually

https://gerrit.wikimedia.org/r/359907

All the changesets for this are merged, so we should see a dedicaed run on the first of the month. I'll keep an eye on it for the first few days to make sure everything's working properly.

The run seems fine; stubs and abstracts ran in batches as expected, number of jobs for all steps is as desired. Closing.