Page MenuHomePhabricator

Some Wikidata XML dumps are failing
Closed, ResolvedPublic


As reported by Attila-da in #wikimedia just now, some Wikidata XML dumps are failing:

Maybe Wikidata needs to have split out dumps similar to the enwiki? I hear the Wikidata database has grown to be quite large.

Resolving this task will presumably require someone with access to the dumps server logging in and poking around at logs (or checking disk space or whatever else).

I couldn't figure out which Phabricator task to use for this.

Event Timeline

MZMcBride raised the priority of this task from to Needs Triage.
MZMcBride updated the task description. (Show Details)

Here are recent dump times and outcomes:
wiki,date,run time in hms,run time in secs,,result,

Is wikidatawiki in same queue as small wikis?
For many of those data are still from December.

Wikidata needs to be moved to the 'big wikis' queue at some point and there are other not so small wikis that should be moved over as well. A question for wikiata dumps users; is once a month often enough for the run or do people need two complete runs? Once a month could be set up now, to run in the second half of each month after the en wiki dumps complete.

I had a look at the previous failed runs to get a sense of what was going on. The causes are various: the dataset1001 host or the snapshot host being rebooted for security updates; the db server being either hung or having been depooled (I didn't check which); a fatal caused somewhere in the wikidatabase code. The lesson to be learned from this is that 20 days for a run is simply too long to guarantee a clean run without something else going wrong in the interim. This is another reason that I think wikidata runs should be parallelized as we do for other large wikis, and moved off short-term to run after en wiki every month, and medium term to a new server along with other not-so-small wikis, if we need more than one run a month.

We are talking about moving from how often to once a month?

I don't think it is ok for our users to do it less often than it is at the moment.

@Lydia_Pintscher are you referring to wikidata? For all practical purposes the current rate is once a month for wikidata anyway. One exception since June 2014: two runs completed in Aug.

We have to distinguish here: Our json dumps will keep running on a weekly schedule, but the other dumps are apparently monthly (and we need those rather more often than less often).

If budget allows let's run dumps more often. But one monthly cycle starting on the first date of each month is better than a 3 week continuous cycle (which grows in length every month anyway). The current scheme frustrates all those users who want monthly stats with only reasonable delay (instead of 4 weeks after a month closes). And other users who require updates per full month. .

See also why synchronized dumps can help us to get early monthly stats.

Just a update on the dump progress for the last few wikidatawiki dumps:

mysql> SELECT subject,dumpdate,progress FROM archive WHERE subject="wikidatawiki";
| subject      | dumpdate   | progress |
| wikidatawiki | 2014-12-05 | error    |
| wikidatawiki | 2014-12-08 | error    |
| wikidatawiki | 2015-01-13 | progress |
| wikidatawiki | 2015-02-04 | progress |
| wikidatawiki | 2015-02-07 | done     |
| wikidatawiki | 2015-03-07 | done     |
| wikidatawiki | 2015-03-30 | done     |
| wikidatawiki | 2015-04-23 | error    |
| wikidatawiki | 2015-05-26 | progress |
| wikidatawiki | 2015-06-03 | done     |
| wikidatawiki | 2015-07-04 | done     |
| wikidatawiki | 2015-08-06 | done     |
| wikidatawiki | 2015-08-26 | progress |
13 rows in set (0.01 sec)

The 20150603's dump was the first successful parallel dump, hopefully we are progressing in the right track in reducing the failures.

WIkidata has been moved to the list of "big" wikis which means jobs run in parallel now, cutting down on processing time. It truly is growing leaps and bounds.

We should be able to do two runs a month as we just did in August, one full run including revision history which will start at the beginning of the month, and one which will start probably around the 20th of the month without revision history. Stubs and current articles should be available by the end of the first week of the month for the first run.

Wll this cover most folks' needs?

ArielGlenn claimed this task.

Please discount the incomplete wikidata multistream job on the last run; that would be me having shot it and then forgot to rerun it. I'll be taking care of that.

We have consistent working runs for wikidata now (and for all wikis actually), so I'm closing this. If there are other issues about frequency of the dumps etc, please open a new ticket for that.