Page MenuHomePhabricator

ThumbnailRender job fails with 429 errors
Closed, ResolvedPublic

Description

ThumbnailRenderJob occasionally fails with 429 errors as seen in logstash

The absolute maximum concurrency for the job execution is 20, which could probably be lowered to 10, but that's on the edge on not being able to clear up any backlog that might occur.

I'm not particularly sure where the 429s come from exactly, but my understanding is that they originate from too many requests for thumbnails coming from a single job runner host, and since all the job runners are now behind an LVS we have no control over which hosts will the jobs end up on, so any concurrency higher then the thumbor worker number of 4 is not safe.

Is my assumption correct that it's the pool counter throttling is the reason for 429s? Do we need pool counter throttling for the job queue requests? Should we try decreasing concurrency and looking what happens, or remove/increase the throttling for job queue requests?

Revisions and Commits

Event Timeline

There can be multiple reasons for throttling 429s, detailed here: https://wikitech.wikimedia.org/wiki/Thumbor#Throttling

It could be hitting the per-IP limit indeed.

It's true that we could make an exception for job queue requests, with a shared secret key passed as a header to bypass all throttling checks, for example.

However, the current situation ensures that, should this job see a spike, it's regulated. At the end of the day, the ThumbnailRenderJob is an optimization, it's pre-rendering thumbnails that users will likely want to view soon. If it fails because it's creating too many requests, it's a good thing, as it prevents a cascading failure if the job runners start acting up.

If it is indeed the per-IP limit, it also means that the end user will get those same thumbnails fine when they try to access them, as the request will come from the user's IP.

So maybe we should just have a higher per-IP limit for the jobrunner IPs, if their list or their range is available in Puppet variables.

Gilles triaged this task as Medium priority.
Gilles added a project: Thumbor.

Hmmm actually I'm not sure anymore that it's the per-IP throttle kicking in, since Thumbor will only throttle by IP if the X-Forwarded-For header is present in the request. Unless the job runner sets it somehow?

In the throttling messages found in logstash, I'm seeing a lot of these ipv6 addresses (half of the per-IP throttles, basically):

2620:0:861:1:208:80:154:22

Which is... webproxy.eqiad.wmnet

Basically, the Mediawiki web proxy is adding itself to the X-Forwarded-For header for the requests it proxies. Which is expected of proxies. Thumbor sees that, assuming that it's the client IP (it's the only thing in the header anyway) and throttles by it.

The most future-proof thing is probably to filter out IP prefixes that belong to us (10.*; 2620:0:860.* etc).

Gilles added a revision: Restricted Differential Revision.Aug 30 2018, 9:54 AM

Which is... webproxy.eqiad.wmnet

Out of curiosity - why is it going through the web proxy?

All HTTP requests coming out of Mediawiki in production go through the web proxy by default, afaik.

Maybe we can make that particular job skip the proxy, since it's pointing to an internal domain, is that what you're getting at? The proxy might be necessary even in that case, I'm not sure.

All HTTP requests coming out of Mediawiki in production go through the web proxy by default, afaik.

Hm.. I thought that's true only for requests to the outside world, not within wmnet?

Apparently it does in that case, maybe something's not acting the way it should?

Hm, maybe that's not the case. I went into the logs again and compared the Thumbor logs and JobQueue logs, and all of the 429 have the PoolCounter key like thumbor-render-<some-hash> - I guess that's throttling for POOLCOUNTER_CONFIG_PER_ORIGINAL? Why are we hitting limits there? The job is requesting too many thumbnails?

Indeed, I've also cross-referenced job failures that hit POOLCOUNTER_CONFIG_PER_ORIGINAL

In the case I saw, the thumbnail was an expensive one to render. What I think is happening is that the per-original concurrency limits are too low when thumbnails are slow to render.

The configuration is indeed very conservative: https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/thumbor/templates/server.conf.erb#L168-L172

This means that the queue can grow very large, but it doesn't get a chance to in this case, because it only waits 8 seconds for an active slot to free up. In other words, when a thumbnail takes more than 8 seconds to render for a given file, the prerender job actually requests 6 sizes concurrently and only 2 of those will get a chance to succeed. That's in the extreme case, in others only a couple will fail.

We can increase the waiting time and/or the concurrency. I think the most conservative thing to do here is to increase concurrency to match the amount of thumbnails the prerender job requests. Those bursts are expected and should work, even if individual thumbnails take a while to render.

Change 456604 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Increase per-original thumbnail throttle for prerender

https://gerrit.wikimedia.org/r/456604

I grepped the logs on webproxy.eqiad.wmnet and found only a single ip requesting thumbs from upload.wikimedia.org over http, 2620:0:861:102:1a03:73ff:fef1:5882 that is ruthenium.eqiad.wmnet aka the parsoid test host. For https the hosts using CONNECT to upload.wikimedia.org are analytics hosts.

Change 456604 merged by Filippo Giunchedi:
[operations/puppet@production] Increase per-original thumbnail throttle for prerender

https://gerrit.wikimedia.org/r/456604

Mentioned in SAL (#wikimedia-operations) [2018-09-03T07:55:46Z] <godog> roll restart thumbor to apply latest config changes - T203135 T20871

I see a lot of request timeouts in the job failures. It's likely that MediaWiki's http client doesn't wait as long as thumbnails can take to render in practice. It seems like currently it's defaulting to wgHTTPTimeout which is 25 seconds. Definitely too little, as the various thumbnail timeouts combined can take minutes.

Now, it does seem wasteful for the job to wait so long for thumbnails to render, when it could just send the requests blindly, but at the same time waiting for the outcome is what let us discover the throttling issue. I'm a bit unsure what to do here, if it's worth making the job wait for thumbnails that take a very long time to render, to catch potential issues, or go back to the original intent of this job, which is to warm up thumbnails in a fire-and-forget fashion, where the outcome of the thumbnail generation doesn't matter, as it will get retried later via organic traffic if it failed at upload time.

I think I'm tempted to actually reduce the timeout to 1s and stop making http-timed-out an error return for the job, considering those as successful job runs instead.

We're still getting from 429s, this time I cross-references them and it's the per-IP (ipv4 and not ipv6 this time):

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.09.03/logback?id=AWWfHtfGN6mdcdu0WpEu&_g=h@66534ad

Which means that my earlier patch ignoring local IPs is needed after all.

re timeouts - for videoscaler jobs we wait - it's super easy to increase the timeout. On the other hand, videoscaling is different in the sense that it's not pre-caching if the job fails - video on a particular scale is not available.

You're a domain expert, I will leave the decision to you. Either approach is possible and easy to make happen.

Change 457964 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Don't wait for thumbnails that take long to render

https://gerrit.wikimedia.org/r/457964

Change 457964 merged by jenkins-bot:
[mediawiki/core@master] Don't wait for thumbnails that take long to render

https://gerrit.wikimedia.org/r/457964

The volume of errors for this job has gone way down. I've looked at what's left and they're legitimate errors where the files can't render. I think the job might be hitting the 429 for failing render either because the person is uploading the same file multiple times, or they've accessed them from their client before the job reaches that particular file. The client requests make the failure counter go up, and then the job is the one too many attempt to render it.

@Pchelolo is this acceptable? Or do you want the job to not produce errors even in that edge case?

@Gilles this sounds like legit errors. Should we ignore 429 errors and not retry the job then?

Can this job be configured to not retry at all? I think that would be ideal. At the end of the day it's a warmup script and it's fine for it to fail.

It's still interesting to see the remaining 429s even if we do that (as there will be some, based on the scenario I've described), as if it the amount grows unexpectedly, it will be a sign of a problem. If we ignore them completely, we lose that visibility.

Can this job be configured to not retry at all? I think that would be ideal. At the end of the day it's a warmup script and it's fine for it to fail.

Sure thing. Will make a patch shortly.

Change 466663 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Do not retry the ThumbnailRenderJob.

https://gerrit.wikimedia.org/r/466663

The patch above will make the retries stop, but will preserve all the logging.

Change 466663 merged by jenkins-bot:
[mediawiki/core@master] Do not retry the ThumbnailRenderJob.

https://gerrit.wikimedia.org/r/466663

Change 466802 had a related patch set uploaded (by Krinkle; owner: Ppchelko):
[mediawiki/core@wmf/1.32.0-wmf.24] Do not retry the ThumbnailRenderJob.

https://gerrit.wikimedia.org/r/466802

Change 466802 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.24] Do not retry the ThumbnailRenderJob.

https://gerrit.wikimedia.org/r/466802

Mentioned in SAL (#wikimedia-operations) [2018-10-11T23:05:40Z] <krinkle@deploy1001> Synchronized php-1.32.0-wmf.24/includes/jobqueue/jobs/ThumbnailRenderJob.php: T203135 - Ib4640eb13ca93f (duration: 00m 49s)