Page MenuHomePhabricator

Displaying the stashed file as a thumbnail in UploadWizard fails with an 503
Closed, ResolvedPublic

Description

I tried to do a batch upload of 27 files from Flickr. The UW worked properly until the Describe step, where the thumbnails would not load. Eventually, the blank thumbnails became the broken file icon.

I tried to upload the same batch directly, and 100% of thumbnails also failed in that case.

@MarkTraceur says Flickr uploads have had inconsistent thumbnail problems for a while now.

My browser console looked something like this:

The Web Console logging API (console.log, console.info, console.warn, console.error) has been disabled by a script on this page.
GET https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/12ywuf3o87y0.2umkum.25308.jpg/100px-12ywuf3o87y0.2umkum.25308.jpg [HTTP/1.1 503 Service Unavailable 10113ms]
GET https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/12ywuep9rbiw.ixensu.25308.jpg/100px-12ywuep9rbiw.ixensu.25308.jpg [HTTP/1.1 503 Service Unavailable 9569ms]
GET https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/12ywuf56jry0.9mpzjg.25308.jpg/100px-12ywuf56jry0.9mpzjg.25308.jpg [HTTP/1.1 503 Service Unavailable 9538ms]
GET https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/12ywufk5y070.9iylwe.25308.jpg/100px-12ywufk5y070.9iylwe.25308.jpg [HTTP/1.1 503 Service Unavailable 7062ms]

[truncated]

Event Timeline

Ragesoss raised the priority of this task from to Needs Triage.
Ragesoss updated the task description. (Show Details)
Ragesoss added a project: UploadWizard.
Ragesoss added subscribers: Ragesoss, MarkTraceur.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ragesoss set Security to None.

Here are details from an UploadStash request:

Request URL: https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/12yxqfmyh91g.2i8jay.25308.jpg/100px-12yxqfmyh91g.2i8jay.25308.jpg
Request Method: GET
Status Code: HTTP/1.1 503 Service Unavailable
Request Headers 15:58:57.000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.4.0
Referer: https://commons.wikimedia.org/wiki/Special:UploadWizard
Host: commons.wikimedia.org
Connection: keep-alive
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Sent Cookie
wikiEditor-0-booklet-Edittools-page: Edittools1
uls-previous-languages: ["en"]
StewardVoteEligible_r8: 1
GeoIP: [censored]
forceHTTPS: 1
commonswikiSession: [censored]
centralnotice_buckets_by_campaign: {"Stewnoms":{"val":1,"start":1421280000,"end":1425081600},"wm2015_scholar":{"val":1,"start":1421625600,"end":1426723140},"Stewvote":{"val":0,"start":1423418400,"end":1427738340},"POTY_2014_R2":{"val":0,"start":1424620680,"end":1427759400},"2015 WMF Strategy Consultation":{"val":1,"start":1424721600,"end":1428264000}}
centralauth_User: Ragesoss
centralauth_Token: [censored]
Response Headers Δ5984ms
X-Varnish: 3286572040, 3463850087, 1404308667
X-Cache: cp1055 miss (0), cp4008 miss (0), cp4010 frontend miss (0)
X-Analytics: https=1
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Server: nginx/1.1.19
Date: Tue, 24 Feb 2015 23:59:18 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 2670
Connection: keep-alive
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Age: 6
Accept-Ranges: bytes

Request: GET http://commons.wikimedia.org/wiki/Special:UploadStash/thumb/12yxqi1bbxyg.vxczzv.25308.jpg/100px-12yxqi1bbxyg.vxczzv.25308.jpg, from 10.128.0.110 via cp1055 cp1055 ([10.64.32.107]:3128), Varnish XID 3287488792
Forwarded for: 67.161.126.166, 10.128.0.110, 10.128.0.110, 10.128.0.110
Error: 503, Service Unavailable at Wed, 25 Feb 2015 00:11:34 GMT

Ragesoss renamed this task from Thumbnails do not appear at Describe step during batch upload from Flickr to Thumbnails do not appear during batch upload.Feb 25 2015, 12:56 AM
Ragesoss updated the task description. (Show Details)

I've found these exceptions:

2015-02-24 23:49:49 mw1182 commonswiki: [55234211] /wiki/Special:UploadStash/thumb/12ywuf3o87y0.2umkum.25308.jpg/100px-12ywuf3o87y0.2umkum.25308.jpg   HttpError from line 111 of /srv/mediawiki/php-1.25wmf18/includes/specials/SpecialUploadStash.php: This file (12ywuf3o87y0.2umkum.25308.jpg) doesn't belong to the current user.
2015-02-24 23:50:06 mw1163 commonswiki: [7868f7e8] /wiki/Special:UploadStash/thumb/12ywuf56jry0.9mpzjg.25308.jpg/100px-12ywuf56jry0.9mpzjg.25308.jpg   HttpError from line 111 of /srv/mediawiki/php-1.25wmf18/includes/specials/SpecialUploadStash.php: This file (12ywuf56jry0.9mpzjg.25308.jpg) doesn't belong to the current user.
2015-02-24 23:50:17 mw1042 commonswiki: [be57b22a] /wiki/Special:UploadStash/thumb/12ywufk5y070.9iylwe.25308.jpg/100px-12ywufk5y070.9iylwe.25308.jpg   HttpError from line 111 of /srv/mediawiki/php-1.25wmf18/includes/specials/SpecialUploadStash.php: This file (12ywufk5y070.9iylwe.25308.jpg) doesn't belong to the current user.
2015-02-24 23:50:25 mw1107 commonswiki: [74d61a81] /wiki/Special:UploadStash/thumb/12ywuf3o87y0.2umkum.25308.jpg/100px-12ywuf3o87y0.2umkum.25308.jpg   HttpError from line 111 of /srv/mediawiki/php-1.25wmf18/includes/specials/SpecialUploadStash.php: This file (12ywuf3o87y0.2umkum.25308.jpg) doesn't belong to the current user.
2015-02-25 00:11:13 mw1089 commonswiki: [984ab362] /wiki/Special:UploadStash/thumb/12yxqfmyh91g.2i8jay.25308.jpg/100px-12yxqfmyh91g.2i8jay.25308.jpg   HttpError from line 111 of /srv/mediawiki/php-1.25wmf18/includes/specials/SpecialUploadStash.php: This file (12yxqfmyh91g.2i8jay.25308.jpg) doesn't belong to the current user.
2015-02-25 00:15:56 mw1149 commonswiki: [b0e9bc18] /wiki/Special:UploadStash/thumb/12yxqi1bbxyg.vxczzv.25308.jpg/100px-12yxqi1bbxyg.vxczzv.25308.jpg   HttpError from line 111 of /srv/mediawiki/php-1.25wmf18/includes/specials/SpecialUploadStash.php: This file (12yxqi1bbxyg.vxczzv.25308.jpg) doesn't belong to the current user.

Is it possible that you got logged out, switched accounts or something like that during your UploadWizard session? The paths seen here show userid 25308 which suggests that they were stored for your "Ragesoss" user account. And according to the code, at the time they were accessed, your userid was different.

Unfortunately rows don't seem to live for very long in the uploadstash table, so I can't see the data for those files right now (which would allow me to check that the us_user value in the DB did correspond to what's seen in the stash url).

Looking at the logs, I see that you seem to be the only person affected on that day, and it hasn't happened at all on other days I picked at random.

Change 194350 had a related patch set uploaded (by Gilles):
Log more information about userid mismatch in upload stash

https://gerrit.wikimedia.org/r/194350

@Gilles I did try pasting those urls into another browser (logged in with a different account, or none at all) while I was testing this bug. But I believe that's the expected behavior, unrelated to the thumbnails not appearing (which I was able to replicate on every one 3 or 4 batch uploads using about 20 jpgs of ~10-20MB each, both from Flickr and from my machine).

Ah ok, so those exceptions were when you were troubleshooting this, but presumably not the issue itself.

Change 194350 abandoned by Gilles:
Log more information about userid mismatch in upload stash

Reason:
Probably not useful, Sage most likely hit those urls when trying to troubleshoot the bug he was encountering.

https://gerrit.wikimedia.org/r/194350

BTW, I just tested this again with a different set of images in the range of 7 to 20MB, and I still get zero thumbnails showing up.

I tried some small files, and these work. I then added a 7 MB file that was one that wouldn't show up in the batch, and it worked. But then a single 20MB file wouldn't work. So this seems to be related to file size, but not in a simple hard-cutoff way.

Given the lack of exception for the actual requests, it's unlikely to be an issue with the request to fetch the thumbnail, or the thumbnail being over the size limit. The duration of your original requests suggests some sort of timeout, maybe at the load balancer level?

It certainly doesn't seem like a good idea in the first place to have a request on the web server holding until the thumbnail is generated and funneled through. It's pretty much commons acting like a private wiki, but with the added handicap of making an internal request to other servers to get the thumbnail generated.

I guess that a simple solution would be to do retries on the JS side and have a shorter timeout on the request to the scalers. Currently it's using the default of 25 seconds. There will still be performance issues with thumbnail taking a long time to show up, but at least they will be less likely to turn into broken thumb images.

It makes sense that there would be a correlation to file size, because thumbnails for those take longer to generate. The longer the generation takes, the more likely the request will be killed because it's taking too long.

The duration of your original requests suggests some sort of timeout, maybe at the load balancer level?

Good point, I didn't notice they have ~10s response times.

As far as I understand how Varnish works, if it receives a HTTP 500 from the backend it would forward that; an 503 means it did not receive any response. So either the backend dies so badly that it just drops the connection (segfault? Linux OOM killer? neither seems likely), or a timeout happens somewhere mid-way (Varnish or the appserver LB).

Maybe @Ragesoss could try to reproduce with the testwiki debug extension, that skips Varnish/LB so if there is a backend error/timeout issue that should become more visible.

There might be additional information in the right logs. (Varnish/LB log? PHP fatal log? apache/hhvm log? syslog if it is indeed an OOM issue?) No clue what and where those are, though.

I guess that a simple solution would be to do retries on the JS side and have a shorter timeout on the request to the scalers.

Note that this does not go to the scalers (which is probably part of the problem). There is not thumbnailing for stashed files, this is a normal web request to the Special:UploadStash wiki page which replies with the full file. That will break for >10M files (T89908), the error message for that issue is less obscure though so this one is at least partially different.

Tgr renamed this task from Thumbnails do not appear during batch upload to Displaying the stashed file as a thumbnail in UploadWizard fails with an 503.Mar 4 2015, 9:48 PM

It does go to the scalers internally through the regular web request, see the outputRemoteScaledThumb function. That internal request has a 25 second timeout (the default in production for curl requests on the internal network). That's why I suspect that something earlier in the chain (varnish, LB, etc.) gives up before the curl request initiated by PHP in the special page request times out cleanly. By lowering the timeout of the request to the image scalers happening in outputRemoteScaledThumb to say, 5 seconds, we should be able to verify that theory.

I tried it again with that testwiki debug extension yesterday.

Give me a ping if there's anything else I can help with, testing-wise.

Change 194823 had a related patch set uploaded (by Gilles):
Lower timeout of upload stash -> image scaler requests

https://gerrit.wikimedia.org/r/194823

Maybe related to T91468 (HHVM buffering "streamed" files into memory before actually sending them), although 10M seems too small to trigger OOM through that bug.

Change 194823 merged by jenkins-bot:
Lower timeout of upload stash -> image scaler requests

https://gerrit.wikimedia.org/r/194823

Change 197284 had a related patch set uploaded (by Gilles):
Lower timeout of upload stash -> image scaler requests

https://gerrit.wikimedia.org/r/197284

Change 197287 had a related patch set uploaded (by Gilles):
Lower timeout of upload stash -> image scaler requests

https://gerrit.wikimedia.org/r/197287

Change 197284 merged by jenkins-bot:
Lower timeout of upload stash -> image scaler requests

https://gerrit.wikimedia.org/r/197284

Change 197287 merged by jenkins-bot:
Lower timeout of upload stash -> image scaler requests

https://gerrit.wikimedia.org/r/197287

Here's an easier set of reproduction/diagnostics steps:

  • upload (but don't publish) some images in UploadWizard
  • go to Special:UploadStash
  • copy the URL of one of the links (which will look like https://commons.wikimedia.org/wiki/Special:UploadStash/file/<stash key>.png)
  • visit https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/<stash key>.png/100px-<stash key>.png

After some delay, you get an 503 with the standard WMF error page. For me, the delay is a little less than 5 sec.

The delay does not depend on the size of the original file nor on the size of the thumbnail. Interestingly, if I request a thumbnail that's >1MB, I get a proper error page after roughly the same delay, which suggests that the breakage is after calling the scalers.

This also exposed some unrelated bugs:

  • the stashed file is rescaled every time, even if the thumbnail exists already from a previous request.
  • thumbload generation works completely differently on wikis where UW is not enabled (see the definition of $wgUploadStashScalerBaseUrl in CommonSettings.php)
  • sometimes, there are two stashed images per uploaded file
In T90599#1130482, @Tgr wrote:
  • the stashed file is rescaled every time, even if the thumbnail exists already from a previous request.

That part works fine on testwiki actually, as demonstrated below. Testwiki is still plagued by the 503, but the waiting time isn't always around 5 seconds. It's actually reasonably fast for the example below.

I've put together this code that I ran on terbium using testwiki as the wiki:

$class = new ReflectionCLass('SpecialUploadStash');
$property = $class->getProperty('stash');
$property->setAccessible(true);
$user = User::newFromName( 'GDubuc (WMF)' );
$uploadstash = new SpecialUploadStash();
$property->setValue($uploadstash, RepoGroup::singleton()->getLocalRepo()->getUploadStash( $user ));
$uploadstash->showUpload('thumb/130uc04gr570.a7hs2y.24671.svg/70px-130uc04gr570.a7hs2y.24671.svg.png');

It's an image that belongs to me in the stash. I've already generated the 70px and the 71px thumbnails that way. You can try running this code and you'll see that re-requesting a size that's already been generated is instantaneous whereas requesting a new one takes a bit of time.

And the bad news is that the code works just fine in eval.php... Either things behave differently in CLI and web requests, or the issue isn't at the PHP level.

The 503 for that very URL which the manual code above reproduces: https://test.wikipedia.org/wiki/Special:UploadStash/thumb/130uc04gr570.a7hs2y.24671.svg/70px-130uc04gr570.a7hs2y.24671.svg.png

As a side note, loading the original works fine in the stash: https://test.wikipedia.org/wiki/Special:UploadStash/file/130uc04gr570.a7hs2y.24671.svg

Obviously, if my stash item expires by the time you read this, all you have to do is to make a new one with an svg or something and adapt the code and urls.

Waiting for the HHVM update, as it's a possible cause. The rest of my investigations hit dead-ends so far.

MarkTraceur lowered the priority of this task from Medium to Low.Dec 3 2015, 3:52 PM

I don't know if this got solved by the HHVM update, but I haven't heard anything for a while, so I'll lower the priority and if anyone comes along who knows anything, we can reevaluate.

So I managed to reproduce this bug, and it looks like a simple timeout. We could raise the timeout (which would only mean some of the thumbnails would be fixed, not all of them) or we could accept this as a fact of life.

The other solution, getting the thumbnail from Flickr, would take a bit more refactoring than I think I'm comfortable dealing with currently. Perhaps in the next round, we can accomplish it.

Krinkle claimed this task.
In T90599#1130449, @Tgr wrote:

Here's an easier set of reproduction/diagnostics steps:

  • upload (but don't publish) some images in UploadWizard
  • go to Special:UploadStash
  • copy the URL of one of the links (which will look like https://commons.wikimedia.org/wiki/Special:UploadStash/file/<stash key>.png)
  • visit https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/<stash key>.png/100px-<stash key>.png

After some delay, you get an 503 with the standard WMF error page. For me, the delay is a little less than 5 sec.

These steps now work for me, with a 500K PNG. Closing for now per inactivity and many changes. Including the change from Imagescalers to HHVM, and then to Thumbor, and now PHP 7. If this is still an issue, it should be filed as a new task.