Page MenuHomePhabricator

Wiki data dump intermittently produces corrupt .xml.bz2 files
Closed, ResolvedPublic

Description

On several occasions we've had corrupt .xml.bz2 files come out of the data dump process.

There are several possible causes:

  • dbzip2 might be corrupting data
  • NFS filesystem transfers might be corrupting data
  • gremlins!

Eliminating dbzip2 as a precaution, to see if this improves matters, would be a good start.

Further checks for corrupt files would also be wise, however. Running a 'bzip2 -t' after generation (or even as a simultaneous side process?) may help to detect bad files and mark them appropriately.

So far, manually re-running the dump produces a correct file; this could be automated if required.


Version: unspecified
Severity: enhancement
URL: http://download.wikimedia.org/enwiki/20080312/

Details

Reference
bz13638

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:06 PM
bzimport set Reference to bz13638.

In r33005 adjusted worker.py to pass dbzip2 mode to dumpTextPass.php only if configured to use dbzip2. Should use regular bzip2 mode for next dumps.

Just marking this fixed for now...