Page MenuHomePhabricator

Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files
Closed, ResolvedPublic

Event Timeline

There were known issues serving images around the time this was reported. Things should be recovering now.

Can this still be reproduced (because I cannot)?

jcrespo renamed this task from Could not load an image in zh.wikipedia.org to Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files.Jan 6 2017, 6:06 PM
jcrespo edited projects, added SRE, SRE-swift-storage, Traffic; removed MediaViewer.
jcrespo subscribed.

@Aklapper The problems have been minimized, but this is still WIP.

Elisfkc claimed this task.

Got upload wizard and Flickr2Commons to work. Based on Special:NewFiles, others can also now upload. Thanks @Aklapper

Special:Upload is back up for me after being down for a while (T154790). I recommend updating the error message text to get users a heads up—either link to a technical noticeboard or downtime indicator or this thread or something

@czar There will be a post mortem, as usual, on https://wikitech.wikimedia.org/wiki/Incident_documentation , that I will make sure to link from the header.

That's great but I meant that non-technical content users won't know what to do with the error message An unknown error occurred in storage backend "local-swift-eqiad". And while that may not be something that can be helped, if it is possible to tip users where to report unknown errors or otherwise give advice on what to do next, it would save them the searching or giving up. Just a thought

As promised, here it is the incident report. https://wikitech.wikimedia.org/wiki/Incident_documentation/20170106-Cache-upload

Sadly, unlike other times, there was not much that users could do differently like "avoid doing X" or "using feature Y". For error reporting, here (Phabricator) is the right place. If it requires real time: then #wikimedia-tech is normally the place to go. Communications team is normally updated, which reports on Twitter and other means if the outage happens for long. Specialized channels exist depending if it is a software error, a server problem, a network issue, etc.

The problem with showing those channels more prominently/directly, is that with 400K requests per second, we would risk bringing down also the fixing/reporting channel or even the bug reporting infrastructure when an error is active, by all extra human requests :-(. The only thing to have into account is to check the IRC topics first and other tickets to avoid duplicate reporting (which is not easy, because sometimes unrelated errors have the same root cause). I just want to note that just receiving that error message is a report by itself, because the number of http errors is constantly monitored: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?from=1483629029900&to=1483825823893 and when many errors/users affected, detected and alarmed immediately.

But like anything else, it is open to suggestions for improvement. I'd suggest opening a separate ticket so we do not bother other subscribers here.

Hello! The same issue happened to us when uploading these files to Commons:

Now we have requested deletion of these files but it seems admins of Commons cannot do it.
cc: @Ladsgroup