Now that we have the more detailed logs, I'm going to review the hits for the thumbnail stack to check whether there are unknown issues surfaced by the log.
|Resolved||Gilles||T185466 Review Varnish slow log for upload.wikimedia.org|
|Resolved||Gilles||T185478 Measure time spent by Thumbor connecting to and reading from Swift|
|Resolved||Gilles||T185479 Lower Thumbor SUBPROCESS_TIMEOUT to 59 seconds to avoid tripping Varnish slow log|
|Resolved||Gilles||T185483 Measure time spent by Thumbor connecting to and reading from Memcache|
|Resolved||Gilles||T185485 Measure time spent by Thumbor connecting to and reading from Poolcounter|
|Resolved||Gilles||T186594 Log time when request makes it to Thumbor|
Starting with the 404s, it's absolutely dominated by a specific thumbnail URL which consistently takes almost exactly 95 seconds:
Opening that thumb URL right now, however, it 404s immediately. It's possible that it's just a 404 thumb URL with very high traffic, and it sometimes hits an edge case that makes it take a lot of time.
Something very peculiar is that the user agent requesting it is very consistent as well on the slow log hits:
Dalvik/2.1.0 (Linux; U; Android 6.0.1; SM-G930F Build/MMB29K)
Which is an Amazon Fire TV Stick...
Based on the name, it seems like it's this logo. I'm going to venture a guess that this thumbnail is hotlinked from an app for the Fire TV (probably this one), which would explain the very high traffic to that particular 404. Which probably confirms that it show up a lot in the slow log because it gets a ton of traffic, and not that it has anything special that makes it occasionally very slow.
Why always around 95 seconds, though? That feels like the sum of different timeouts at first glance.
95 seconds is below the 180 proxy_read_timeout set for nginx, which suggests that it's not nginx timing out, but something behind it.
Looking at our Thumbor code, we have SWIFT_CONNECTION_TIMEOUT to establish the connection to Swift. In production that's currently set to 20 seconds. Then we call get_object, which, if I'm following the whole trail of code correctly, doesn't have any timeout (the underlying requests library won't set a default timeout). This might suggest that the timeout comes from Swift (proxy? object server?). I'm going to defer to @fgiunchedi on figuring out what timeouts in Swift might kick in for such a situation.
Next, 500s. PDFs dominate this, always slightly over 60 seconds, with a response-gs-time just over 60 seconds as well. I saw some SVG example as well, this time with response-rsvg-convert-time just over 60 seconds.
This is simply our command timeout (SUBPROCESS_TIMEOUT) kicking in and killing the underlying command on the minute mark. Expected behavior. Maybe we should lower it to 59 seconds to avoid hitting the Varnish slowlog?
200s. The majority of the examples I can find aren't rendered by Thumbor, but merely fetched from Swift. And since, as described before, there is no timeout in Thumbor for Swift reads, those complete eventually. It's possible that they failed at the Swift proxy level because they hit some timeout there, then proceeded to be processed by Thumbor because they were "missing" from Swift.
It would be interesting to know why Swift can take more than a minute to respond with a thumbnail it has in storage.
When looking for log entries that were actually rendered by Thumbor, none of them is explained by the time captured by response-Processing-Time, which is always less than 2 seconds. Which suggests that time spent elsewhere (could still be the Swift client, looking for a long time before 404ing and letting Thumbor render the thumbnail).
429s. This one can't be attributed to Swift, since the attempt to fetch from Swift happens after the throttling checks. The Poolcounter timeouts are set to 8 seconds. Time spent connecting to and reading from Memcache and Poolcounter aren't currently measured and should be.
For the record: It was uploaded to enwiki in July 2010 at https://en.wikipedia.org/wiki/File:BYUtv_2010_logo.svg, and deleted October 2017 in favour of a copy on Commons at https://commons.wikimedia.org/wiki/File:BYUtv_logo.svg.
Going back to this now that we have more logging information, on an example of a 404 from the varnish backend's perspective, the time isn't spend in memcache, poolcounter nor swift. response-request-date is from nginx's perspective, it might be interesting to log (if possible) when thumbor is made aware of the request, to see if this is an issue of nginx or thumbor request queueing.
Example of a request that thumbor "receives" a minute after nginx records its request time: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.02.12/logback?id=AWGKqY2xWMYqG9Ui5XTw&_g=() Oddly, this seems to only happen with 404 requests. One would think that if this was a general issue, it would appear in 500s as well, but it doesn't. And it's always almost exactly one minute difference for 404s. My first instinct would be that nginx retries 404s. But the nginx config is such that 404s shouldn't be retried, though:
proxy_next_upstream error timeout invalid_header http_502 http_504; proxy_next_upstream_tries 2;
I'm tempted to bump the only 2 nginx timeouts that currently default to 60s (we override proxy_read_timeout to 180s):
If the delay seen in 404s changes to match the news value of either of these, it would be confirmation that they're related to the issue.
While 404s aren't meant to be retried, if a timeout of those types happens, it would be, as it's not a 404 if it times out before it has a response status code.
Now that 200s log the Thumbor-Request-Date, I've found an example of a 200 with 40s of delay between nginx and thumbor: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.02.13/logback?id=AWGOuA7cWMYqG9UiQxzA&_g=()
OK, it's definitely not the nginx timeouts. 404s and 200s are affected, now that I can see 200s. And I'm seeing now that it's consistently 59 seconds, not 60 seconds.
I think this is pretty clearly Thumbor concurrency. Those are requests that are "taken hostage" by a slow one on the Thumbor instance they're assigned. We're not seeing values lower than 59 seconds, because those combined with an average request won't go over 60 seconds and show up in the varnish slow log.
The ideal solution would be for the main processing (done by subprocesses in our case) performed by Thumbor to be truly async, letting the Thumbor instance process other requests while long-running commands run as separate processes. Our existing PoolCounter limits should be enough to avoid concurrency going crazy, because getting such concurrency would replicate what Mediawiki was doing.
Now, last time I tried to make the upstream changes to Thumbor to achieve async execution inside the engine, it proved too hard, because it requires making synchronous code asynchronous across many layers of code that weren't designed for it. It's a high cost/low reward situation.
A mitigation strategy that comes to mind would be to segregate expensive thumbnails to their own thumbor instances. This might be possible with some lua in Nginx. But it's a trade-off: fast thumbnails won't be affected anymore, but it would make the average expensive thumbnails more expensive. Because PDF-only instances would run into that issue of requests inheriting another requests's slowness very often.
Increasing capacity is also an option. Or maybe just Thumbor process count. The less concurrent requests a given Thumbor instance deals with, the less likely it is that this scenario will be encountered.