Page MenuHomePhabricator

Long running thumbnail requests locking up Thumbor instances
Closed, ResolvedPublic

Description

There have been pages from thumbor.svc (and recoveries shortly afterwards) lately, digging further this seems to be caused by an influx of long-running requests to thumbor locking up all available instances, recapping:

  • Influx of "heavy" requests hits thumbor
  • All thumbor instances are busy processing the request
  • nginx can't find a free thumbor instance where to send requests and emits "502 bad gateway"
  • pybal depools the affected thumbor machine, if too many machines happen to be busy the depool threshold is hit
  • At the same time also icinga checks thumbor.svc and gets back a 502 from the remaining machines that can't be depooled anymore
  • Things eventually recover, either because the long-running requests are successful and thumbs are now in swift or fail and the failed request rate limit kicks in

Event Timeline

AFAICT from the thumbor dashboard at the time of the outage it is the ghostscript engine (and thus PDF processing) spiking up in its request time

Screen Shot 2017-08-09 at 14.59.35.png (268×977 px, 55 KB)

Below there's a list of top 20 files that failed to get converted today, unsurprisingly lots of pdfs there.
Checking the first file it seems ghostscript hangs though the pdf file is only 45MB, next thing I'll try the conversion with stretch's ghostscript and see if the behaviour is the same.

grep -i "get_image failed" /srv/log/thumbor/thumbor.error.log | awk '{print $15}' | sort | uniq -c | sort -nr | head -20

53 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.58/5/58/Italiani_Illustri.pdf`.
52 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.f1/f/f1/Map_of_Texas_highlighting_Ellis_County.svg`.
52 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.44/4/44/Wilmettesuburban00mulf.pdf`.
51 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.2e/2/2e/Map_of_Alabama_highlighting_Jefferson_County.svg`.
51 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.16/1/16/Map_of_Texas_highlighting_Armstrong_County.svg`.
50 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.5f/5/5f/Map_of_Texas_highlighting_Kerr_County.svg`.
49 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.fd/f/fd/Map_of_Texas_highlighting_Henderson_County.svg`.
49 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.bc/b/bc/Map_of_Texas_highlighting_Delta_County.svg`.
49 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.ba/b/ba/Map_of_Texas_highlighting_Guadalupe_County.svg`.
49 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.2b/2/2b/Map_of_Texas_highlighting_Lipscomb_County.svg`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-en-local-public.d3/d/d3/Dishonored_2_Corvo_vs_Emily_Gameplay.webm`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.fc/f/fc/Ladies%27_Home_Journal_Vol.14_No.12_%28November%2C_1897%29.pdf`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.e8/e/e8/Map_of_Texas_highlighting_Orange_County.svg`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.e4/e/e4/Map_of_North_Carolina_highlighting_Onslow_County.svg`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.bd/b/bd/Map_of_Michigan_highlighting_Jackson_County.svg`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.a0/a/a0/Map_of_Virginia_highlighting_Prince_William_County.svg`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.8e/8/8e/Map_of_Oregon_highlighting_Multnomah_County.svg`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.57/5/57/Tree_edges.svg`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.4f/4/4f/Ladies%27_Home_Journal_Vol.14_No.04_%28March%2C_1897%29.pdf`.
48 `https%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.2c/2/2c/Map_of_Virginia_highlighting_Arlington_County.svg`.

I did some load testing yesterday that caused 502s, about 2 hours before you filed this task, where I purposefully hit Thumbor with a lot of concurrent requests for uncached thumbnails for a large JPG. I was seeing a mix of 502s (the majority) and 429s in the responses. I thought that was an ok outcome, but clearly not. We probably want to lower our per-IP throttling values, or revisit the concept of throttling what we expect to be expensive files (which is something Mediawiki did). Which would have expensive files 429 in case like this, in favor of "normal" files going through.

Do you have a list of pages with times? How often does it happen organically?

Do you have a list of pages with times? How often does it happen organically?

Good question, I don't have a list of files and timings offhand but I bet it can be extracted from nginx access logs. Organically we've seen thumbor paging only once or twice before.

FTR this happened again last night (UTC), I'm currently working on having thumbor run on stretch in T170817 which will also bring a newer gs. In my quick experiments I couldn't reproduce the lockup we've seen with the files above. This with per-filetype throttling that @Gilles mentioned should help with mitigation.

I think that the current per-IP PoolCounter limits are just too generous. A single user can hog up to 32 workers right now. IMHO, what matters for a given user is the size of the queue, but they can afford to wait if there are a lot of thumbnails on the page that need rendering.

Expensive file types are already capped at 16 workers max altogether (I forgot that we had that already). The outage I caused was when rendering large JPGs, which aren't capped in that way. I assume that rendering 32 of those concurrently was just too much. After all there are 160 thumbor processes, so that's potentially a single IP hogging 20% of our capacity!

I'm going to drastically reduce the per-IP capacity to 4 workers. That's 2.5% of our current concurrency capacity, which sounds a lot more reasonable. After the config change has gone through, I'll run a stress test again to see if it still triggers 502s or not.

Gilles triaged this task as Medium priority.
Gilles added a project: Performance-Team.

Change 372054 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Reduce the per-IP concurrency limit in Thumbor

https://gerrit.wikimedia.org/r/372054

Change 372054 merged by Filippo Giunchedi:
[operations/puppet@production] Reduce the per-IP concurrency limit in Thumbor

https://gerrit.wikimedia.org/r/372054

I still triggered 502s, that wasn't sufficient.

The script used (file is purged before proceeding): P5889

I wonder if issuing 429s isn't blocking for too long (if the timeout config value is in seconds, this suggests that with a queue of 500 all of these 200 concurrent requests will wait for 8 seconds before 429ing?).

What I need to verify is whether a thumbor process is truly blocking while waiting on a poolcounter lock for timeout seconds and unable to process other requests (queueing them, I guess). If that's the case, then the thumbor processes waiting for their turn are hitting the worst case scenario where they're waiting more than 8 seconds for that lock, since the actual processes rendering those thumbnails for the given IP-based lock take a while.

In this particular scenario I think what might make things worse is that it's the same IP and the same original, meaning that the 2 specialized locks are hit one after the other. If you were lucky enough to grab a per-IP lock, you need to be even more lucky to grab a per-original lock (there are only 2 of those at any given time). Meaning that if thumbor is indeed blocked on the poolcounter timeout value, you have a decent likelihood of waiting the worst case (16 seconds) before being given a 429.

What I also need to figure out is under what conditions nginx 502s. Is it when the thumbor backends won't accept the request? Won't start sending a response fast enough? Because from Thumbor's perspective I'm pretty sure that other requests happening at the same time should complete eventually, they might just be delayed by 16 seconds if all the available processes hit the worst case scenario above.

Some relieving news: I've tested a specific lock (per-original) and the event-based async-like behavior of thumbor works perfectly:

  • I telnet to poolcounter directly and lock a given original's thumbor per-original lock with a worker count of 1 and a timeout of 60 seconds. I leave that connection open, blocking the lock.
  • I request a thumbnail for the same original and the same worker/timeout settings. The client hangs, waiting for a response.
  • From the vagrant machine (not going though nginx, since I wouldn't control which thumbor instance I hit), I hit that thumbor instance and ask it for another thumbnail (different original), it renders it just fine
  • Once the 60 seconds have elapsed, the client that was waiting on the lock gets a 429 (actually a 503 for me, but thumbor reported sending a 429).

I'll continue the investigation focusing on the relationship between nginx and thumbor.

I've filed a revert for the 1-connection-per-backend: https://gerrit.wikimedia.org/r/#/c/372199/ On Vagrant, while reproducing the conditions above, adding the max_conns setting causes 502s instantly. Without the max_conns setting, a given thumbor instance can switch between several concurrent requests just fine via yielding. In essence, a given thumbor process is capable of serving concurrent requests and therefore, concurrent connections.

I expect that once the max_conns changed reverted, my abuse script shouldn't cause 502s anymore, since yielding on poolcounter is very fast.

I think that lowering the per-IP worker concurrency should have been enough to fix the original organic occurrences. And if not, we can reduce the worker count for expensive files as well.

Lastly, if users complain of a higher rate of 429s with the lower worker count we've introduced, we might actually want to increase the timeout value for the per-IP poolcounter settings. This will give more time for "parked" requests to get a chance to get a slot before 429ing, the overhead being only connections kept open.

Running the stress test again, requesting about 2000 uncached thumbnails of the same image with a concurrency of 200 requests, I got:

  • 25 200s
  • 1960 429s
  • 6 503s
  • 0 502s

Which is a lot better than before where 502s were the most common response.

The 503s came quite late, there were among the last response, possibly issued when hitting proxy_read_timeout?

Which is a lot better than before where 502s were the most common response.

Indeed, thanks for the investigation!

The 503s came quite late, there were among the last response, possibly issued when hitting proxy_read_timeout?

I believe in that case the 503 might be varnish hitting its first_byte_timeout (35s for varnish upload now)

Change 372517 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: bump ProxyFetch timeout for thumbor

https://gerrit.wikimedia.org/r/372517

Change 374204 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: increase check timeout

https://gerrit.wikimedia.org/r/374204

Change 374204 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: increase check timeout

https://gerrit.wikimedia.org/r/374204

Change 372517 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: bump ProxyFetch timeout for thumbor

https://gerrit.wikimedia.org/r/372517