Page MenuHomePhabricator

data dump has CRC error
Closed, DeclinedPublic

Description

Author: zhangxiaoquan

Description:
In an attempt to study the incentives to contribute to wikipedia, Feng Zhu from
Harvard Business School and I (MIT Sloan School of Management) wanted to examine
the modification history of the wikipedia entries. We downloaded the following
data dump file:
http://download.wikipedia.com/enwiki/20060518/enwiki-20060518-pages-meta-history.xml.bz2
and found that it contains CRC errors in it. We then followed the link to
download a few previous versions of the file, but they all had problems.
Here is the error message returned by bzip2recover:

  • error message ----

bzip2 -t enwiki-20060518-pages-meta-history.xml.bz2
bzip2: enwiki-20060518-pages-meta-history.xml.bz2: data integrity (CRC) error in
data

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

  • end of error message ----

Version: unspecified
Severity: major
URL: http://download.wikipedia.com/enwiki/20060518/

Details

Reference
bz6172

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 9:20 PM
bzimport set Reference to bz6172.
bzimport added a subscriber: Unknown Object (MLST).

Please provide the following information:

  1. md5 checksum of the file (md5sum 2006*.bz2)
  2. your version of bzip2 (bzip2 --version)
  3. your operating system and version
  4. your cpu architecture

zhangxiaoquan wrote:

  1. md5 checksum of the file (md5sum 2006*.bz2)

d521c59d94852920648565a80f5e1b90
it is different from the one on the website, but I had no problem while
downloading this file, and I tried it several times.

  1. your version of bzip2 (bzip2 --version)
1.0.3  15-Feb-2005
  1. your operating system and version
RedHat Enterprise Server 3
  1. your cpu architecture
Intel Pentium III (Coppermine) 1GHz

Looks like your download is corrupt then.

Check in particular that your download program handles
large files (this is over 33 gigabytes; if your download
is less than 4 gigabytes then you have a buggy download
tool or a buggy HTTP proxy).

Note that the 7zip version of this file is smaller and
faster to download, but still over 5 gigabytes so you need
to confirm that your download was correct.

zhangxiaoquan wrote:

Thanks for the tip. I used wget 1.10.2, the file size is correct. I never had
a problem with wget, but I admit 33GB might be too much for any download program.

I wonder if you can create a special version for us with the content in <text>
tags removed, for our purpose of research, we only need the modification history
(who modified what at what time, etc.) Thanks, please let us know.

The reason we want to go for bzip2 (instead of 7zip) is that I wrote a perl
parser to read directly from bzip2 files and write a new file without
information in the <text> tags. I'm not sure if it is feasible with 7zip format.

Tried reading from a pipe on stdin? (7zip also decompresses
about 10 times faster than bzip2.)

zhangxiaoquan wrote:

I'll give it a try, can you leave this open till I download and verify the
md5checksum?
Thanks a lot for the help!

Really old, so gonna go ahead and close.