~30k/day, which is 0.5% of all requests to thumbor.
Let's see if we can get that number down.
~30k/day, which is 0.5% of all requests to thumbor.
Let's see if we can get that number down.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Gilles | T121388 Service-based thumbnailing re-architecture in production with Thumbor | |||
Resolved | • Gilles | T150746 Nginx timeouts on Thumbor | |||
Resolved | • Gilles | T161535 Track nginx request time in Thumbor debug headers | |||
Resolved | • Gilles | T161531 Track time spent in Thumbor subprocesses with header(s) |
My guess would be that thumbor fails or takes a long time and nginx 504s for those. Likely MW doesn't 504s because apache in front never does (or just 503s in similar cases)
Makes sense. Do we care about those, then? It sounds like Nginx/Thumbor behaves better than Apache in regards to timing out at that level.
Definitely we don't, did you have a chance to see if the mw/200 vs thumbor/504 was real or intermittent? I spot-checked the urls above and they seemed to render fine in mw
I've confirmed that the 504 is coming from nginx's 60s limit. And I see that mediawiki on the first example takes 87 seconds to render the thumbnail (for a size not currently in swift, that's important). Either mediawiki doesn't have a limit, or it's higher than the 60s we've picked for Thumbor.
Now, I don't know what it does when "nginx" hangs up, but I just tried hitting thumbor directly for that thumbnail and when I tried it took 164 seconds to render the thumbnail successfully. That was a PDF. This is confirmation that ImageMagick can take its sweet time when dealing with a huge resize (the PDF extraction command is subject to a 60s timeout). In another attempt it took 40 seconds, so there seems to be a lot of performance variability depending on what else Thumbor is doing at the same time.
I'll keep trying some of those on image scalers to see if some requests take even longer, to get a sense of whether Mediawiki has unbounded time here, just like the IM portion of Thumbor currently does, albeit with nginx hanging up early with a 504 while Thumbor probably keeps working on the thumbnail. Which means, though, that the next time the user requests it, it might be in swift this time.
A lot of them in that list turn out to be very fast to render both on mediawiki and thumbor. This would suggest intermittent issues with thumbor, but with no explanation as to why it would time out on trivial files. Could it be request queuing under load? Where one request received early by Thumbor isn't actually rendered for a while, because Thumbor is busy rendering something other requests it received around the same time? If that's the case, it would be an argument for thumbor having per-request rendering time limits and nginx having a very high just-in-case time limit.
One way to verify this would be to hammer thumbor with a bunch of requests in rapid succession, with a mix of fast and slow-to-render files and seeing if the fast files are delayed. I'll try that locally. If that's the case, though, then the need to time-limit IM processing is more important than ever...
At this point I've seen various reasons why Thumbor might hang, I think this scenario probably contains distinct individual bugs that need to be identified.
Looking at all these examples and recent occurrences, the issue is indeed intermittent, which would support the load theory that I need to look into. I imagine that since we don't have an absolute limit for thumbor processing a given image, it might be spending a lot of time rendering a thumbnail in IM, nginx hangs up on it for that particular requests, but in the meantime the thumbor instance at hand keeps accepting requests and not processing them, and nginx will 504 on those as well before they get a change of running on the thumbor process' single thread.
I wonder it this could be mitigated by nginx not sending requests to thumbor instances that are "busy". Because even if we fix the timeout issue at the thumbor level, there are still scenarios where piled up requests will 504 after the expensive one has timed out. And even if they don't, that means someone would wait 1+ minute for a very trivial thumbnail to be rendered, if they happened to be unlucky and have their request land on a thumbor instance that was already busy rendering an expensive thumbnail.
The answer of the investigation is that things work as expected and small requests are still processed fast on a given instance even if it's also busy with an expensive one... I'm still at a loss to explain those non-reoccurring 504s.
They happen a fair bit:
gilles@ms-fe1001:/var/log/swift$ cat server.log.1 | grep "Mediawiki: 200 Thumbor: 504" | wc -l 995
The only theory I can think of, since the amount is lower than OOM deaths, is that it might somehow be related. I'll check if there's any correlation between the two time-wise.
There definitely seems to be a time correlation between 502s and 504s, this sort of thing happens a lot:
Nov 23 06:05:35 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 429 Thumbor: 502 URL: http://upload.wikimedia.org/wikipedia/id/thumb/8/8f/Tingkatan.jpg/180px-Tingkatan.jpg Nov 23 06:05:35 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 429 Thumbor: 502 URL: http://upload.wikimedia.org/wikipedia/id/thumb/8/8f/Tingkatan.jpg/180px-Tingkatan.jpg Nov 23 06:05:53 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 200 Thumbor: 504 URL: http://upload.wikimedia.org/wikipedia/commons/thumb/5/59/Usa_counties_large.svg/12800px-Usa_counties_large.svg.png Nov 23 06:05:55 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 404 Thumbor: 504 URL: http://upload.wikimedia.org/wikipedia/en/thumb/e/e2/Small_bunyapine.jpg/240px-Small_bunyapine.jpg Nov 23 06:06:07 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 404 Thumbor: 504 URL: http://upload.wikimedia.org/wikipedia/he/thumb/c/c5/ECO_99_FM.jpg/200px-ECO_99_FM.jpg Nov 23 06:06:33 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 404 Thumbor: 504 URL: http://upload.wikimedia.org/wikipedia/he/thumb/c/c5/ECO_99_FM.jpg/200px-ECO_99_FM.jpg Nov 23 06:06:37 ms-fe1001 proxy-server: HTTP status code mismatch. Mediawiki: 404 Thumbor: 504 URL: http://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Sky_News.svg/250px-Sky_News.svg.png
One thing that comes to mind is that when a thumbor instance starts it's not ready to serve requests immediately. Maybe the 504s are requests that were sent to the instance during that startup phase? In which case nginx would need to wait a bit before putting a freshly (re)started thumbor instance into rotation.
Or maybe nginx doesn't cross-reference ongoing requests when one dies due to OOM, and the ones that were "taken down with the rest" ultimately 504.
It seems like Mediawiki 200ing while Thumbor 504s has about halved:
gilles@ms-fe1005:/var/log/swift$ cat server.log.1 | grep "Mediawiki: 200 Thumbor: 504" | wc -l 549
Now that we've migrated IM to a subprocess, all engines should be timeboxed. But the various timeouts we have for Thumbor are chained (loader timeout for swift, subprocess timeout, etc.). As far as I remember the subprocess timeout itself is 60 seconds. With nginx configured the same, it's no wonder that nginx still 504s.
I think maybe we should increase the nginx timeout significantly (setting it to 2 or 3 minutes), which should take into account the worst case scenario of the request waiting a bit to be picked up, the loader being slow and the subprocess hitting the 60s limit, as well as the non-timeboxed request processing by thumbor (eg. filters).
Actually, nginx was already at 90s, but I think we can double that and see the effect, because it might not be enough as-is to accomodate the worst case scenarios.
Change 344360 had a related patch set uploaded (by Gilles):
[operations/puppet@production] Increase nginx timeout on Thumbor machines
Change 344360 merged by Filippo Giunchedi:
[operations/puppet@production] Increase nginx timeout on Thumbor machines
Things don't seem to have improved :(
gilles@ms-fe1005:/var/log/swift$ cat server.log.1 | grep "Mediawiki: 200 Thumbor: 504" | wc -l 745
Looking at the nginx logs on thumbor1001, I notice that some of the timeouts are for files that don't exist. Example: https://upload.wikimedia.org/wikipedia/en/thumb/5/5a/Premier_League.svg/125px-Premier_League.svg.png
It's a deleted file, however. Might be one of those cases where the original is still lingering in swift, where Thumbor picks it up and attempts to render it. However, in access.log I see the same file 404 correctly (a lot, in fact), which would suggest that's not the case.
I saw a lot of SVGs in recent timeout errors, so I checked what ratio of these we get. But it's inconclusive: 35.8% of timeout errors are SVGs, and 30.8% of requests are SVGs.
While that Premier_League_Logo 404s a lot, it doesn't run into timeouts that often:
gilles@thumbor1001:/var/log/nginx$ cat access.log | grep Premier_League_Logo | wc -l 3631 gilles@thumbor1001:/var/log/nginx$ cat error.log | grep Premier_League_Logo | wc -l 16
Which to me suggests that it's probably timing out as a side-effect. I wonder if it's not poor requests handling scheduling, where a single thumbor instance gets several requests queueing up, including one like that, requests that take a while are treated, and that simple request that was waiting gets processed a long time after it was queued and by that time nginx has given up. What would be interesting is to know the exact order and times of nginx sending the requests to the thumbor instances. access.log and error.log record requests once they're done, not when they're given to thumbor for processing.
The weakness of that theory, though, is that increasing the nginx timeout should have improved the situation, and it doesn't seem to have done that.
Previously I thought that timeout were probably due to thumbor instances dying, but looking at nginx's error.log I see that they happen for thumbor instances that keep running. While timeouts might happen on thumbor process death, it's probably not the main explanation for this phenomenon.
Watching the error.log live, the scheduling theory looks plausible, because series of timeouts seem to happen on the same thumbor instance, in small series. Here's an example:
2017/03/27 16:20:37 [error] 105390#105390: *22350034 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.64.32.220, server: , request: "GET /wikipedia/en/thumb/0/0c/Red_pog.svg/6px-Red_pog.svg.png HTTP/1.1", upstream: "http://127.0.0.1:8810/wikipedia/en/thumb/0/0c/Red_pog.svg/6px-Red_pog.svg.png", host: "thumbor.svc.eqiad.wmnet:8800" 2017/03/27 16:20:38 [error] 105376#105376: *22350128 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.64.32.221, server: , request: "GET /wikipedia/commons/thumb/9/90/Ladies%27_Home_Journal_Vol.14_No.06_%28May%2C_1897%29.pdf/page1-82px-Ladies%27_Home_Journal_Vol.14_No.06_%28May%2C_1897%29.pdf.jpg HTTP/1.1", upstream: "http://127.0.0.1:8810/wikipedia/commons/thumb/9/90/Ladies%27_Home_Journal_Vol.14_No.06_%28May%2C_1897%29.pdf/page1-82px-Ladies%27_Home_Journal_Vol.14_No.06_%28May%2C_1897%29.pdf.jpg", host: "thumbor.svc.eqiad.wmnet:8800" 2017/03/27 16:20:39 [error] 105383#105383: *22350162 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.64.32.221, server: , request: "GET /wikipedia/commons/thumb/5/5f/LE-0628-10-15N.gif/240px-LE-0628-10-15N.gif HTTP/1.1", upstream: "http://127.0.0.1:8810/wikipedia/commons/thumb/5/5f/LE-0628-10-15N.gif/240px-LE-0628-10-15N.gif", host: "thumbor.svc.eqiad.wmnet:8800" 2017/03/27 16:20:43 [error] 105389#105389: *22350421 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.64.0.38, server: , request: "GET /wikipedia/en/thumb/0/0c/Red_pog.svg/6px-Red_pog.svg.png HTTP/1.1", upstream: "http://127.0.0.1:8810/wikipedia/en/thumb/0/0c/Red_pog.svg/6px-Red_pog.svg.png", host: "thumbor.svc.eqiad.wmnet:8800" 2017/03/27 16:20:46 [error] 105386#105386: *22350661 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.64.0.38, server: , request: "GET /wikipedia/en/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png HTTP/1.1", upstream: "http://127.0.0.1:8810/wikipedia/en/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png", host: "thumbor.svc.eqiad.wmnet:8800" gilles@thumbor1001:/var/log/nginx$ ps aux | grep 8810 gilles 9476 0.0 0.0 12728 2296 pts/0 S+ 16:23 0:00 grep 8810 thumbor 16094 0.0 0.0 18272 2100 ? Ss 13:05 0:00 /usr/bin/firejail --profile=/etc/firejail/thumbor.profile --env=TMPDIR=/srv/thumbor/tmp/thumbor@8810 --env=MAGICK_TEMPORARY_PATH=/srv/thumbor/tmp/thumbor@8810 --env=MAGICK_DISK_LIMIT=900MB --env=MAGICK_MEMORY_LIMIT=900MB -- /usr/bin/thumbor --port 8810 --ip 127.0.0.1 --keyfile /etc/thumbor.key --conf /etc/thumbor.d/ thumbor 16125 0.0 0.0 18268 1852 ? S 13:05 0:00 /usr/bin/firejail --profile=/etc/firejail/thumbor.profile --env=TMPDIR=/srv/thumbor/tmp/thumbor@8810 --env=MAGICK_TEMPORARY_PATH=/srv/thumbor/tmp/thumbor@8810 --env=MAGICK_DISK_LIMIT=900MB --env=MAGICK_MEMORY_LIMIT=900MB -- /usr/bin/thumbor --port 8810 --ip 127.0.0.1 --keyfile /etc/thumbor.key --conf /etc/thumbor.d/ thumbor 16243 1.6 0.2 372300 168152 ? Sl 13:05 3:10 /usr/bin/python /usr/bin/thumbor --port 8810 --ip 127.0.0.1 --keyfile /etc/thumbor.key --conf /etc/thumbor.d/ gilles@thumbor1001:/var/log/nginx$ date Mon Mar 27 16:24:01 UTC 2017
So, by the point the 16:20 series of timeout happens, the process has been running for more than 3 hours.
Now, if I try one of those requests:
gilles@thumbor1001:/var/log/nginx$ curl -I http://127.0.0.1:8810/wikipedia/en/thumb/0/0c/Red_pog.svg/8px-Red_pog.svg.png HTTP/1.1 200 OK Engine: wikimedia_thumbor.engine.imagemagick Content-Length: 464 Swift-Time: 12 Content-Disposition: inline;filename*=UTF-8''Red_pog.svg.png Wikimedia-Original-Container: wikipedia-en-local-public.0c Thumbor-Parameters: {"width": "8", "image": "http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-en-local-public.0c/0/0c/Red_pog.svg", "filters": "format(png)"} Server: Thumbor/6.0.1 Wikimedia-Thumbnail-Container: wikipedia-en-local-thumb.0c Etag: "ae65f3cca4a54904eb6d0ac845ed1203d800e56c" Date: Mon, 27 Mar 2017 16:26:09 GMT Xkey: File:Red_pog.svg Wikimedia-Original-Path: 0/0c/Red_pog.svg Content-Type: image/png Exiftool-Time: 88 Exiftool-Time: 89 Wikimedia-Thumbnail-Path: thumbor/0/0c/Red_pog.svg/8px-Red_pog.svg.png
It felt like it took several seconds, though. It seems like subprocess processing time isn't part of the debug headers, I've filed a task for that: T161531: Track time spent in Thumbor subprocesses with header(s)
At any rate, it doesn't timeout when retried, probably because that particular Thumbor instance is "less busy" now.
I was reviewing the load balancing strategy we're using, discovered least_time which sounded great until I saw that it was only available in the paid version of nginx...
To verify my theory, I would have to be able to log requests to Thumbor when they come in. Since Thumbor is single-threaded I doubt that it's itself capable of such logging. When Thumbor gets to pick up the request, it's probably been waiting for some time in Tornado. I wonder if nginx can add a header to its request before passing it to the upstream? That would be a foolproof way to take into account any queueing delay experienced at the Thumbor level.
So, in theory our nginx config retries on the next upstream:
# fallback to the next upstream at most once, and no longer than 30s proxy_next_upstream_timeout 30; proxy_next_upstream_tries 1;
However these two directives probably don't do anything at the moment... because we don't have a proxy_next_upstream directive. Which explains why I don't see any signs of retries when a 504 happens for a particular image in the logs:
gilles@thumbor1001:/var/log/nginx$ cat access.log.1 | grep 120px-75-Jahre-Wolfsburg.jpg 10.64.32.220 - - [26/Mar/2017:14:28:34 +0000] "GET /wikipedia/commons/thumb/1/1a/75-Jahre-Wolfsburg.jpg/120px-75-Jahre-Wolfsburg.jpg HTTP/1.1" 404 0 "-" "WikipediaApp/5.4.0.1095 (iOS 10.2.1; Phone)" 10.64.32.221 - - [26/Mar/2017:14:49:29 +0000] "GET /wikipedia/commons/thumb/1/1a/75-Jahre-Wolfsburg.jpg/120px-75-Jahre-Wolfsburg.jpg HTTP/1.1" 504 183 "-" "WikipediaApp/5.4.0.1095 (iOS 10.2.1; Phone)" gilles@thumbor1001:/var/log/nginx$ cat error.log.1 | grep 120px-75-Jahre-Wolfsburg.jpg 2017/03/26 14:49:29 [error] 105388#105388: *15852276 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.64.32.221, server: , request: "GET /wikipedia/commons/thumb/1/1a/75-Jahre-Wolfsburg.jpg/120px-75-Jahre-Wolfsburg.jpg HTTP/1.1", upstream: "http://127.0.0.1:8816/wikipedia/commons/thumb/1/1a/75-Jahre-Wolfsburg.jpg/120px-75-Jahre-Wolfsburg.jpg", host: "thumbor.svc.eqiad.wmnet:8800"
Which means that if my theory of poor request scheduling holds true, simply fixing the nginx retry directives should mitigate the issue.
Ah, there is a default for that directive:
proxy_next_upstream error timeout;
But maybe that's not enough for the situation these 504s run into?
One should bear in mind that passing a request to the next server is only possible if nothing has been sent to a client yet. That is, if an error or timeout occurs in the middle of the transferring of a response, fixing this is impossible.
Maybe that's what's getting in the way? It's possible that Thumbor has already sent response headers for those requests (since some of that happens in the request handler) and is waiting for the actual image processing to get picked up.
Seems like I've found some low-hanging fruit in the meantime, though: http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout
With the default being 60s, the other timeout we have is never reached, because Thumbor definitely stays silent while generating an image. That might explain why increasing keepalive_timeout didn't do anything.
Change 344999 had a related patch set uploaded (by Gilles):
[operations/puppet@production] Improve Thumbor nginx timeout settings
Change 344999 merged by Filippo Giunchedi:
[operations/puppet@production] Improve Thumbor nginx timeout settings
Alright, the timeouts are solved now. No more 504s. Now the only errors in nginx's logs are when a thumbor instance dies, which results in a 502. Which suggests that there is no retry in that situation, which needs to be fixed. But it's a much smaller problem (1600 requests per day, 0.02% of requests), for which I'll file a separate task.