Page MenuHomePhabricator

Include checksums in https://dumps.wikimedia.org/wikidatawiki/entities/
Closed, ResolvedPublic

Description

Please include text files with the hash values of the future entity dumps in https://dumps.wikimedia.org/wikidatawiki/entities/ in order to check data integrity. These files could be similar to the *sums.txt ones in https://dumps.wikimedia.org/wikidatawiki/latest/.

Event Timeline

@hoo, can you fold this into the bash script without too much work?

@hoo, can you fold this into the bash script without too much work?

Piece of cake (I guess)… so yes, will schedule this for the week.

Would we want one hash sum file per (dated) folder, or one for everything? Or both?

If one for everything, should it contain just the base file names (like wikidata-20180323-truthy-BETA.nt.bz2), or the relative path (like 20180323/wikidata-20180323-truthy-BETA.nt.bz2).

One per folder I suppose, so that as a particular run finishes up, the hash info is available.

That's also the easiest option for users, I think.

Change 423353 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Add checksums for Wikidata entity dumps

https://gerrit.wikimedia.org/r/423353

Change 423353 merged by ArielGlenn:
[operations/puppet@production] Add checksums for Wikidata entity dumps

https://gerrit.wikimedia.org/r/423353

First checksums are available: https://dumps.wikimedia.org/wikidatawiki/entities/20180402/wikidata-20180402-md5sums.txt and https://dumps.wikimedia.org/wikidatawiki/entities/20180402/wikidata-20180402-sha1sums.txt for https://dumps.wikimedia.org/wikidatawiki/entities/20180402/.

I manually added the JSON checksums, but the RDF ones were automatically added. I'll check next week to make sure this also correctly works for the JSON checksums, but I don't expect any surprises there.

JSON checksums look fine as well:

hoo@snapshot1007:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20180409$ md5sum -c wikidata-20180409-md5sums.txt 
wikidata-20180409-all.json.gz: OK
hoo@snapshot1007:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20180409$ sha1sum -c wikidata-20180409-sha1sums.txt
wikidata-20180409-all.json.gz: OK