Page MenuHomePhabricator

Include checksums in https://dumps.wikimedia.org/wikidatawiki/entities/
Closed, ResolvedPublic

Description

Please include text files with the hash values of the future entity dumps in https://dumps.wikimedia.org/wikidatawiki/entities/ in order to check data integrity. These files could be similar to the *sums.txt ones in https://dumps.wikimedia.org/wikidatawiki/latest/.

Details

Related Gerrit Patches:
operations/puppet : productionAdd checksums for Wikidata entity dumps

Event Timeline

abian created this task.Mar 22 2018, 9:16 PM

@hoo, can you fold this into the bash script without too much work?

hoo added a comment.Mar 27 2018, 2:47 AM

@hoo, can you fold this into the bash script without too much work?

Piece of cake (I guess)… so yes, will schedule this for the week.

hoo added a comment.Mar 28 2018, 2:25 PM

Would we want one hash sum file per (dated) folder, or one for everything? Or both?

If one for everything, should it contain just the base file names (like wikidata-20180323-truthy-BETA.nt.bz2), or the relative path (like 20180323/wikidata-20180323-truthy-BETA.nt.bz2).

One per folder I suppose, so that as a particular run finishes up, the hash info is available.

abian added a comment.Mar 28 2018, 2:47 PM

That's also the easiest option for users, I think.

Change 423353 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Add checksums for Wikidata entity dumps

https://gerrit.wikimedia.org/r/423353

hoo claimed this task.Apr 2 2018, 1:14 AM
hoo moved this task from Tasks to Needs Review on the Wikidata-Ministry-Of-Magic board.

Change 423353 merged by ArielGlenn:
[operations/puppet@production] Add checksums for Wikidata entity dumps

https://gerrit.wikimedia.org/r/423353

hoo closed this task as Resolved.Apr 5 2018, 12:07 PM

First checksums are available: https://dumps.wikimedia.org/wikidatawiki/entities/20180402/wikidata-20180402-md5sums.txt and https://dumps.wikimedia.org/wikidatawiki/entities/20180402/wikidata-20180402-sha1sums.txt for https://dumps.wikimedia.org/wikidatawiki/entities/20180402/.

I manually added the JSON checksums, but the RDF ones were automatically added. I'll check next week to make sure this also correctly works for the JSON checksums, but I don't expect any surprises there.

abian added a comment.Apr 5 2018, 1:45 PM

Thank you!

hoo added a comment.Apr 11 2018, 9:48 AM

JSON checksums look fine as well:

hoo@snapshot1007:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20180409$ md5sum -c wikidata-20180409-md5sums.txt 
wikidata-20180409-all.json.gz: OK
hoo@snapshot1007:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20180409$ sha1sum -c wikidata-20180409-sha1sums.txt
wikidata-20180409-all.json.gz: OK
Envlh awarded a token.Apr 16 2018, 3:03 PM
ArielGlenn moved this task from Backlog to Done on the Dumps-Generation board.May 8 2018, 7:49 AM