Page MenuHomePhabricator

Combined latest revision article dumps are 404s in some languages
Closed, ResolvedPublic

Description

The combined latest revision article dumps broke in some languages starting on July 7, 2015. Example:

$ curl -I "http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2"
HTTP/1.1 404 Not Found
Server: nginx/1.1.19
Date: Wed, 15 Jul 2015 00:27:55 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 169
Connection: keep-alive

I strongly suspect the latest directory is getting updated before the dump has completed. http://dumps.wikimedia.org/jawiki/20150703/ shows the files in the "latest" directory but also shows that the combined latest revision articles dump hasn't been created yet.

Event Timeline

Deveshnetflix raised the priority of this task from to Needs Triage.
Deveshnetflix updated the task description. (Show Details)
Deveshnetflix subscribed.
Aklapper renamed this task from Data Dumps to Combined latest revision article dumps are 404s in some languages.Jul 15 2015, 11:17 PM
Aklapper set Security to None.

This time I was in time to see the broken link. We should be keeping the link to the last fiel around and not removing it; I need to see why that doesn't happen.

http://dumps.wikimedia.org/wikidatawiki/latest/ the link itself is missing for some reason. In the meantime other links to previous dump steps are still there, for example http://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-abstract.xml

Turns out to be a problem with dumps that are split into pieces (stubs, pages, etc). This bug has been around for awhile but never triggered, since we usually only clean up links from the previous run after the dump is complete. With staged dumps, we do it after every stage. Looking at how to fix it.

https://gerrit.wikimedia.org/r/#/c/230507/ merged. I'll leave this open until the next full clean run.

Thanks for leaving it open for now. I hope it can be closed soon based on demonstrable success, even if it subsequently blows up again..

https://gerrit.wikimedia.org/r/#/c/230507/ merged. I'll leave this open until the next full clean run.

I assume that has happen in the last 35 days. If not or not successful, please reopen this task.

For a user a problem isn't resolved until the operation in question has
been run with the desired tools, automation and integration somewhat
predictably. I keep on noticing that scripts have to modified, schedule
smoothed out, etc. Thus the user problem doesn't really seem resolved at
the time all associated tasks have been performed and marked as resolved.

These criteria for success for a user may not be relevant for managing the
week-to-week activities that lead to resoling users problems, so perhaps a
user-side system for tracking a problem and a programmer-side system for
tracking the steps to resolve the problem cannot be the same system.