Page MenuHomePhabricator

do we really need to recombine stub and page file chunks into single huge files?
Closed, ResolvedPublic

Description

We run en eikipedia dumps by producing multiple stub and page text files, instead of one huge stub file and one huge page/meta/history file.

Recombining these into one file takes a long time; for the stubs it's not horrible, as these files are smaller, but for the history files it is extremely time-intensive (2 weeks). We could shorten that for the bz2 files by working on dbzip2, brion's parallel bzip2 project from 2008, but we probably can't do anything to speed up the recombine of the 7z files.

Do we really need to provide one huge file for these things? Example: the combined bz2 history file is around 300GB, the combined 7z file is around 32 GB. And it will only get worse. Are several small files ok? Maybe we can just skip this step.

This needs community discussion; are the whole files useful? What happens if we wind up running 50 jobs and producing 50 pieces? Is this just too annoying? Is it better instead because people can process these 50 files in parallel at home? Would it be better if we serve up say no more than 20 separate pieces? Do people care at all as long as they get the data on a regular basis?


Version: unspecified
Severity: enhancement

Details

Reference
bz27114

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:23 PM
bzimport set Reference to bz27114.

see my comment on bug #26499. You can simply "cat" bzip2 files together.

You can but if you want the resulting file to have only one header and one footer then you need to strip the headers and footers from the pieces which means uncompression and recompression.

If all we want is to provide the pieces (with all their headers and footers) in a single package for easy download I'd rather provide users with a simple means to facilitate download of all the pieces than keep essentially two copies of the data and therefore use twice the storage.

Well, I happen to agree with you that multiple files are easier to deal with, but the trend seems to be towards the single, huge file. Modern file transfer and storage makes the two approaches close to equivalent. I am in neither camp.

The header and footer could be created as isolated bz2 chunks at a cost of only a few bytes. Then they will be easy to verify and strip back off without codec. Unfortunately, php's bzflush() is a NOP and does not call the existing bzlib flush, but you could close and reopen the file...

It seems valuable to preserve the metadata of each job output (see bug #26499), so assuming the pages are organized under a root job-segment element, there is really no header to strip off but the "<?xml version" cruft.

Here's an interesting, if irrelevant, recommendation for a new "xml fragment" representation,

http://www.w3.org/TR/xml-fragment

Note also section C.3, where they discuss how fragments could be used to index into a huge document in order to minimize parsing. (yes, i am axe-grinding for bug #27618 !)

My 2 cents:
I cannot think of a use case where a single file is preferred above multiple smaller files. If the argument is, I need to download x number of chunks and I don't want that then I suggest to write a simple script that gets all the chunks. We could even provide such a script if that's really important.

Well I've been producing pieces without recombining for a while now without complaints... silence = consent! Closing.