We run en eikipedia dumps by producing multiple stub and page text files, instead of one huge stub file and one huge page/meta/history file.
Recombining these into one file takes a long time; for the stubs it's not horrible, as these files are smaller, but for the history files it is extremely time-intensive (2 weeks). We could shorten that for the bz2 files by working on dbzip2, brion's parallel bzip2 project from 2008, but we probably can't do anything to speed up the recombine of the 7z files.
Do we really need to provide one huge file for these things? Example: the combined bz2 history file is around 300GB, the combined 7z file is around 32 GB. And it will only get worse. Are several small files ok? Maybe we can just skip this step.
This needs community discussion; are the whole files useful? What happens if we wind up running 50 jobs and producing 50 pieces? Is this just too annoying? Is it better instead because people can process these 50 files in parallel at home? Would it be better if we serve up say no more than 20 separate pieces? Do people care at all as long as they get the data on a regular basis?