Page MenuHomePhabricator

Wiki data dump bzip2 -> 7zip conversion doesn't report failure on corrupt input
Closed, ResolvedPublic

Description

When the history .xml.7z file is generated, sometimes the bzip2 decompression fails. This may be due to a corrupt file in the first place...

But the failure here is hidden. Bzip2 spews an error and exits, but 7zip happily considers it the end of the file and wraps up "successfully".

The error condition in the input should be detected; at a minimum, this allows the corrupted output file to be marked as failed.


Version: unspecified
Severity: enhancement
URL: http://download.wikimedia.org/enwiki/20080312/

Details

Reference
bz13637

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:06 PM
bzimport set Reference to bz13637.

More dump generation bits...

Currently the .7z files are generated by decompressing the .bz2 and piping into p7zip... this is kinda slow and also won't report errors properly at present.

We could refuse to generate the 7z file if the bz2 input file is truncated. Would that be sufficient? (We have a fast way to detect that now.)

If the bz2 file is truncated it is now moved out of the way at the end of the step that produces it. This means it's not going to be available as input for the 7z file, so that step will fail and be marked as such. Closing.

I should say that code isn't deployed for anything but en wiki dumps yet. I'd better make it live on the other servers.