Page MenuHomePhabricator

checksum file incorrectly formated for incremental XML data dumps
Closed, ResolvedPublic

Description

0) Problem

Checksum files for incremental XML data dumps are not formatted correctly. This causes `md5sum' to throw an error.

  1. Test case

(shell) wget -nH -np -nv -N -r -l 2 http://dumps.wikimedia.org/other/incr/simplewiki

(shell) cd other/incr/simplewiki/

(shell) ls
simplewiki-20140703-md5sums.txt
simplewiki-20140703-pages-meta-hist-incr.xml.bz2
simplewiki-20140703-stubs-meta-hist-incr.xml.gz

(shell)$ cat simplewiki-20140703-md5sums.txt
d03f3a91ef0273eb814f39a1d13788cb
c51f2bd5ef6bd42ce65cf4a7fca72400

(shell)$ md5sum --check simplewiki-20140703-md5sums.txt
md5sum: simplewiki-20140703-md5sums.txt: no properly formatted MD5 checksum
lines found

(shell)$ cat simplewiki-20140703-md5sums.txt
d03f3a91ef0273eb814f39a1d13788cb
c51f2bd5ef6bd42ce65cf4a7fca72400

  1. Recommendation

The correct format is:

<checksum><two spaces><filename><newline>

Sincerely Yours,
Kent


Version: unspecified
Severity: normal
See also: T34130

Details

Reference
bz67886
Related Gerrit Patches:

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:40 AM
bzimport set Reference to bz67886.
Aklapper triaged this task as Low priority.Mar 23 2015, 5:42 PM
Aklapper added a subscriber: Aklapper.
Nemo_bis updated the task description. (Show Details)Apr 9 2015, 8:14 AM
Nemo_bis set Security to None.

Can someone point me to the code that generates the md5sums file for the incremental dumps.
This bug is too easy to leave unfixed for a year.

in ariel branch of operations-dumps: dumps/xmldumps-backup/incrementals/generateincrementals.py function md5sums()

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptJul 14 2016, 7:18 PM
awight claimed this task.Dec 19 2016, 7:44 PM
awight moved this task from Backlog to Active on the Dumps-Generation board.
awight added a subscriber: ArielGlenn.

I found this interesting bit in the Wikipedia article on md5sum,

Note: There must be two spaces or a space and an asterisk between each md5sum value and filename to be compared (the second space indicates text mode, the asterisk binary mode). Otherwise, the following error will result: "no properly formatted MD5 checksum lines found". Many programs don't distinguish between the two modes, but some utils do.

We should check which mode we're using when calculating the digests.

@ArielGlenn
I think the directory structure has changed since your comment above. I see the file on the ariel branch under xmldumps-backup/see_master_branch/generateincrementals.py, but when I look on the master branch I don't see that file, please help clarify why the directory has that name when you get the chance.

ah because that script is now called uh generatemiscdumps.py I think it is, and there's a little class for "incrementals" ie adds/changes, and one for html dumps. The idea is that if we want other similar dumps across all wikis of some new form we just use the same misc library and calling wrapper, which handles locks and dates and cleanup and etc. If you look at the recent git log you'll see it.

Change 328219 had a related patch set uploaded (by Awight):
Make md5sums.txt files compatible with md5sum --check

https://gerrit.wikimedia.org/r/328219

Change 328219 merged by ArielGlenn:
Make md5sums.txt files compatible with md5sum --check

https://gerrit.wikimedia.org/r/328219

ArielGlenn closed this task as Resolved.Jan 30 2017, 9:37 AM

Merged and deployed, thanks for the patch and your patience.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Jan 30 2017, 9:37 AM