On several occasions we've had corrupt .xml.bz2 files come out of the data dump process.
There are several possible causes:
- dbzip2 might be corrupting data
- NFS filesystem transfers might be corrupting data
- gremlins!
Eliminating dbzip2 as a precaution, to see if this improves matters, would be a good start.
Further checks for corrupt files would also be wise, however. Running a 'bzip2 -t' after generation (or even as a simultaneous side process?) may help to detect bad files and mark them appropriately.
So far, manually re-running the dump produces a correct file; this could be automated if required.
Version: unspecified
Severity: enhancement
URL: http://download.wikimedia.org/enwiki/20080312/