Page MenuHomePhabricator

Generate a list of files that are supposed to exist but 404s
Open, NormalPublic

Description

For the parent task, which is that specific revisions are 404-ing on swift, it has been reported on #wikimedia-commons that not one, but many files, in some random content categories, are gone. Assuming the root cause is not-so-easy to be found, a list of affected files will help in determining how widespread is the bug, and possibly generate a list of files for bots to reupload from the files' original sources (eg: flickr).

Right now Commons contains 43639434 files and 47459938 file revisions:

MariaDB [commonswiki_p]> SELECT "image" AS `table`, COUNT(1) FROM image UNION SELECT "oldimage" AS `table`, COUNT(1) FROM oldimage\G
*************************** 1. row ***************************
   table: image
COUNT(1): 43639434
*************************** 2. row ***************************
   table: oldimage
COUNT(1): 3820504
2 rows in set (1 min 12.09 sec)

Processing one file per second, to put least amount of stress on the servers, will take more one to two years (considering that the number of images continues to increase over time); however, measuring the time taken for each HEAD request via timeit shows that east request is really fast:

08:35:37 0 ✓ zhuyifei1999@tools-bastion-05: ~$ python -m timeit -s 'import requests as r; s = r.Session(); s.head("https://upload.wikimedia.org/")' 's.head("https://upload.wikimedia.org/wikipedia/commons/2/2e/Burbuja_%281496994920%29.jpg")'
100 loops, best of 3: 3.05 msec per loop
08:36:00 0 ✓ zhuyifei1999@tools-bastion-05: ~$ python -m timeit -s 'import requests as r; s = r.Session(); s.head("https://upload.wikimedia.org/")' 's.head("https://upload.wikimedia.org/wikipedia/commons/5/5c/Mig-29s_intercepeted_by_F-15s_-_DF-ST-90-05759.jpg")'
100 loops, best of 3: 3.77 msec per loop

With only one thread and one concurrent request this could effectively reduce the total time consumption to only a few days, but I'm unsure whether this will create too much load on the servers. How fast should the generator run? How many concurrent requests and/or throttling per request should be done?

Event Timeline

zhuyifei1999 triaged this task as Normal priority.Dec 13 2017, 8:44 PM
zhuyifei1999 created this task.
zhuyifei1999 raised the priority of this task from Normal to Needs Triage.Dec 13 2017, 9:04 PM

Maybe it would be possible to extract from swift the list of files stored there? Then no HTTP requests would be needed (unless this shows that the problem lies on a different layer, of course).

If querying files, it would be interesting to spread the fetches from different periods. So for instance, if after scanning a 0.02%, most misses are from uploads on June 2016, we could suspect that "something" happened around that time, and look further into that period. OTOH, we may find that misses are evenly distributed, and the fact these files were uploaded then is just a consequence of being on that category.

(this is obviously a simplification, there have been multiple reasons for data loss in the past, some of them which were already identified)

The images in the category page linked above are from a single user and a single bot, do you have more examples?

As to finding files that existing in the db but not in swift, we'd be likely do that via a mediawiki maintenance script. There's one to find files in swift but not in the db (see https://phabricator.wikimedia.org/T111838#1763274) I don't know about the other way around tho.

I did something similar years ago to pre-generate thumbnails for WikiMiniAtlas with its unusual sizes (48x48). Stored the content-length and response code. This was useful in multiple ways, we were able to identify corrupt files, missing files, rendering problems, and had a list of thumbnails that somehow were 2 MB in size--Those ended up being very large ICC Profiles.

Ottomata triaged this task as Normal priority.Jan 16 2018, 7:33 PM
Jojr149 claimed this task.Feb 6 2018, 1:43 PM
Aklapper removed Jojr149 as the assignee of this task.Feb 6 2018, 2:45 PM
Aklapper added a subscriber: Jojr149.

@Jojr149: Do you plan to work on this task? If not then please don't set yourself as assignee. Thanks!

@Aklapper I'm currently analyzing SVGs files for problems 404 file, missing xmlns=, font issues, etc. In a month or two, I will download all thumbnails on Commons again for several projects of mine.

@Aklapper I'm currently analyzing SVGs files for problems 404 file, missing xmlns=, font issues, etc. In a month or two, I will download all thumbnails on Commons again for several projects of mine.

What resolutions? We might want those are downloadable bundles. Can you weigh in on T184744 please?

In the meantime, @Aklapper what scripts are you using to find these problem files? I'd like to reuse/extend them if possible.

In the meantime, @Aklapper what scripts are you using to find these problem files? I'd like to reuse/extend them if possible.

I assume you meant to ask @Dispenser. I don't work on this.

@Aklapper woops yes, indeed.