Page MenuHomePhabricator

missing checksums for many files in backups
Closed, ResolvedPublic

Description

md5sums.txt and sha1sums.txt do not contain checksums for at least pages-meta-current in almost all projects in dumps.wikimedia.org.

Event Timeline

The checksums now are back but was it a script bug?

Aklapper subscribed.

Hi @Xoristzatziki, thanks for taking the time to report this!

As the issue described in this task cannot be reproduced anymore and as no code was changed in the codebase, I'm closing this task as declined.

The problem still exists. https://dumps.wikimedia.org/elwiktionary/latest/elwiktionary-latest-sha1sums.txt does not contain pages-meta-current. I can provide a todays printscreen, in case u check it after some days.

Sorry, just realised how to upload the print screen.

Στιγμιότυπο από 2017-07-02 09-47-30.png (1×1 px, 176 KB)

Looking at the sha1 of dumps of elwiktionary the elwiktionary-20170620-sha1sums.txt contains pages-meta-current, but the elwiktionary-20170701-sha1sums.txt does not. And the 20170701 is the one that is in the directory of latests.

The printscreen you include is from before the pages dumps were run. If you look at the sha1 and md5 sum listings now, they have all files.

-rw-rw-r-- 1 datasets datasets 3043 Jul  6 03:25 20170701/elwiktionary-20170701-sha1sums.txt

shows that the complete listing after the final dump step ran, was completed only a few days later.

-rw-rw-r-- 1 datasets datasets 38941266 Jul  3 13:16 20170701/elwiktionary-20170701-pages-meta-current.xml.bz2

The step you wanted ran a day after you looked at the listings.

My objection is that I download a file from: XXXXX/latest/ which I cannot verify by downloading (from the same XXXXX/latest/ page) the sha1.txt
Either the file should not be there (so I can search it somewhere else, or download a previous) or it must be included in the sha1.txt in the same page.

Ok. Since the name of the folder is "latest" all files always must be there. But we can keep, in sha1sums.txt, the old sha1 (hash and filename) until a newer file "appears".

The sha1.txt file is generated as the dump runs; new hashes are added to it as each file is produced. It doesn't contain a mix of old and new information but only the hases for the given run. We could avoid updating any latest links until the run completes, but some folks want the files right away, which is understandable. In theory, downloaders would check the date in all filenames to make sure they correspond to the desired dump run. I'm open to suggestions for how we can make this better for users.

Any thoughts? The hash file lists are per dump run, just as all run status files and content files, and we need them to remain that way, so that information on any given run is consistent. What else could be provided, that would address your concerns?

I only use the pages-meta-current in about 10 projects and not so strict about the days... . So my concerns are not so valuable. I bypassed it by not checking the hash if it is to close to the days that dumps run. I am really not so bothered by that. Please close it if no one else is worried.

ArielGlenn claimed this task.

OK, I will close this for now.