Page MenuHomePhabricator

Some dumps do not have checksums
Open, Needs TriagePublicFeature

Description

This feature was voted #134 in the 2023 Community Wishlist Survey.

Problem: Some files on dumps.wikimedia.org do not have checksums.

Proposed solution: Add checksums to the dumps to verify data integrity.


Supplied examples include:

Some other examples are:

Event Timeline

One way this could be done might be to add the checksum for dumps created this month via a script and run the script daily for new dump files in some directory trees. Most dumps in https://dumps.wikimedia.org/other belong to other teams and are not maintained by us (Platform Engineering). They are also produced by different scripts, some on different hosts.

One way this could be done might be to add the checksum for dumps created this month via a script and run the script daily for new dump files in some directory trees. Most dumps in https://dumps.wikimedia.org/other belong to other teams and are not maintained by us (Platform Engineering). They are also produced by different scripts, some on different hosts.

That could work! I have a PoC Python script for iterating through a directory, hashing files, and writing hashes to *-md5sums.txt/*-sha1sums.txt files — who do I need to poke to get access to where the dumps are stored? :-)

pending a repo to commit this to, I've placed the POC script at https://gitlab.wikimedia.org/-/snippets/70

Summarizing from a conversation on IRC:

  • access would be to the clouddumps servers; WMCS owns them so an access request task should be filed with a description of why you want the access and what level of access, I would start with just the regular account and see if that gets the job done; as I understand it, access would be nice for looking at the directory tree under public/xmldatadumps/other and (eventually) for testing a script with dryrun.
  • subdirectories have various layouts, filenames don't all have the same format, so this will need to be accounted for in any script. Some subdirs might have checksum files provided already, so they would need to be skipped.
  • presumably the script would run on one clouddumps host and rsync afterwords to the other (since the other can sometimes be pressed into service to take on all duties while the first server is undergoing maintenance, let's say); this should all be discussed with WMCS folks though.
  • there's no simple way to just adjust all the various dumps to add this feature; we are talking about output files generated by a gamut of MediaWiki maintenance scripts, or in some cases fetched from some other host entirely (e.g. phabricator dumps).

To me it feels like any checksums should be generated by the code generating the dumps and not some generic script afterwards, or otherwise errors created during the generation->storage transfer phase will go unnoticed.