Page MenuHomePhabricator

Explore the possibility of splitting dewiki and frwiki into smaller chunks
Closed, ResolvedPublic

Description

Currently, enwiki provides the full history dumps in many small files since 2011/2012. It was proposed on Xmldatadumps-l that this feature be expanded to dewiki, but it didn't happen (presumably due to my initial opposition as it was incompatible with the archiving scripts back then).

I propose that we implement this feature for both dewiki and frwiki as they are the next 2 largest wikis. While frwiki's files are not as big as dewiki, they both pose problems during the rsync to Labs, and for the last few dumps dewiki has never had a successful complete copy to Labs without manual intervention. Splitting the dumps will allow the whole dump to be successfully copied over to Labs without much issues, as evident in the enwiki dumps.

This will need discussion from users of these dumps.

The impact

  1. File sizes will become smaller for certain dump types
  2. The pageids will be available in the file name, making it easier to obtain just what you need
  3. Increases the overall reliability of the dump production process

Event Timeline

Hydriz raised the priority of this task from to Low.
Hydriz updated the task description. (Show Details)
Hydriz added a project: Dumps-Generation.
Hydriz added subscribers: Hydriz, ArielGlenn.

This would entail turning on the 'checkpoint' feature for these wikis. The result would be one file generated every twelve hours (for each job running; there are four jobs that run at once to produce the full history dumps). The file names would have the start and ending page ids embedded in them, just like the en wikipedia dumps.

Yep, basically the same format as the existing English Wikipedia dumps. Wikidata dumps seems to have similar issues about the file sizes and would be great if this change is applied for that wiki as well.

Is this proposal feasible? If it is, shall I send a mail to xmldatadumps-l and maybe the communities involved about this proposed change?

Yes, it's 100% feasible. As I said it would just take flipping a configuration setting for those two (three) wikis. So please do carry the discussion forward onto the email lists.

Hydriz set Security to None.

Any complaints? No one has commented over here yet, but I don't know if silence=consent.

I suggest we leave this for about a month till mid December for any possible comments, seems to be the general trend when asking for comments on dumps-related tasks.

All right; in practice this means leaving it for Jan 1's run, since by mid-December the next run will be already in progress, and the second run of the month doesn't generate content history.

I have sent an explanatory email to the xmldatadumps list, taking the decision as a given. If we hear strong reasoned objections we can move the deadline back from Jan 1, though I would prefer not to. Please feel free to forward the mail onto other fora as appropriate.

Change 263411 had a related patch set uploaded (by ArielGlenn):
dumps: checkpointing for de and fr wikipedia

https://gerrit.wikimedia.org/r/263411

Change 263411 merged by ArielGlenn:
dumps: checkpointing for de and fr wikipedia

https://gerrit.wikimedia.org/r/263411

ArielGlenn claimed this task.

This is now live and deployed; the next dump run (which will be starting shortly and is overdue by several days) will reflect the change.