Page MenuHomePhabricator

Split svwiki history files into chunks
Closed, ResolvedPublic

Description

Hello folks at Wikimedia,
some wikipedia metahistory dumps would be better off splitted into files, for example last Swedish metahistory bzip2 dump is 17.2 GB..

What's your thought on the matter?

Enrico

Event Timeline

ArielGlenn triaged this task as Medium priority.Jun 20 2019, 5:20 PM

Svwiki's history run is starting to take long enough to split it up indeed. I'll add this to my todo list.

Thanks!! 😁 If I may push my luck a little bit more: would you consider splitting Cebuano?

The problem here is not about the size itself - 6.6 GB - more about the quantity of articles - 5.4 millions - in one single file: in our use case this hinders parallelism, slowing down the whole process.

Have a nice day,
Enrico

You can ask, but this time I'll say "not yet" :-D

The runtime is not long enough to justify it at this point:

Wiki: cebwiki              Duration: 30h, 55m, Start: 1560139080 (2019-06-10 03:58), End: 1560250398 (2019-06-11 10:53)

I get a nice little chart of these every month for the revision history content jobs, among other things, showing which wikis are the slowest. 30 hours is really pretty decent.

In the future (the near-mid future, not the long-term future) I expect t have tools that will let you write pages within a given offset range in the file, to an output stream for processing. This will involve some bzip2 block trickiness, which is why we don't have the tools just yet.

Well played, well played... we'll keep our fingers crossed for your tools then ;)

Thanks again,
Enrico

Reedy renamed this task from Unsplitted BIG metahistory files to Split svwiki history files into chunks.Jun 20 2019, 8:11 PM

Change 518189 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] svwiki officially 'big', 6 dumps jobs in parallel like the others

https://gerrit.wikimedia.org/r/518189

Change 518189 merged by ArielGlenn:
[operations/puppet@production] svwiki officially 'big', 6 dumps jobs in parallel like the others

https://gerrit.wikimedia.org/r/518189

This is now live and should take effect with today's run, which will start later in the day than usual, as we are waiting for wikidata to finish up.

ArielGlenn claimed this task.

Svwiki dumps are running now and I see that separate stub files are being produced, so this task can be closed.