Page MenuHomePhabricator

Thumbor times out on large files sometimes
Closed, ResolvedPublic

Description

This djvu is 112M, it runs into a 20s limit for the download of the original:

Oct  5 11:46:22 thumbor1001 thumbor@8833[123915]: thumbor:WARNING ERROR retrieving image http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.4b/4/4b/The_Film_Daily,_Jan-Jun_1924.djvu: HTTP 599: Timeout
Oct  5 11:46:22 thumbor1001 thumbor@8833[123915]: tornado.access:ERROR 504 GET /wikipedia/commons/thumb/4/4b/The_Film_Daily%2C_Jan-Jun_1924.djvu/page220-3013px-The_Film_Daily%2C_Jan-Jun_1924.djvu.jpg (127.0.0.1) 20111.68ms
Oct  5 11:46:22 thumbor1001 thumbor@8809[110891]: thumbor:WARNING ERROR retrieving image http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.4b/4/4b/The_Film_Daily,_Jan-Jun_1924.djvu: HTTP 599: Timeout
Oct  5 11:46:22 thumbor1001 thumbor@8809[110891]: tornado.access:ERROR 504 GET /wikipedia/commons/thumb/4/4b/The_Film_Daily%2C_Jan-Jun_1924.djvu/page223-180px-The_Film_Daily%2C_Jan-Jun_1924.djvu.jpg (127.0.0.1) 20056.65ms
Oct  5 11:48:07 thumbor1001 thumbor@8807[67008]: thumbor:WARNING ERROR retrieving image http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.4b/4/4b/The_Film_Daily,_Jan-Jun_1924.djvu: HTTP 599: Timeout
Oct  5 11:48:07 thumbor1001 thumbor@8807[67008]: tornado.access:ERROR 504 GET /wikipedia/commons/thumb/4/4b/The_Film_Daily%2C_Jan-Jun_1924.djvu/page226-180px-The_Film_Daily%2C_Jan-Jun_1924.djvu.jpg (127.0.0.1) 20155.31ms
Oct  5 11:48:27 thumbor1001 thumbor@8820[112135]: thumbor:WARNING ERROR retrieving image http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.4b/4/4b/The_Film_Daily,_Jan-Jun_1924.djvu: HTTP 599: Timeout
Oct  5 11:48:27 thumbor1001 thumbor@8820[112135]: tornado.access:ERROR 504 GET /wikipedia/commons/thumb/4/4b/The_Film_Daily%2C_Jan-Jun_1924.djvu/page226-180px-The_Film_Daily%2C_Jan-Jun_1924.djvu.jpg (127.0.0.1) 20043.39ms

Downloading it on its own takes 4ish seconds, which could easily hit 20s with concurrent requests:

gilles@thumbor1001:~$ wget http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.4b/4/4b/The_Film_Daily,_Jan-Jun_1924.djvu
--2016-10-05 11:54:05--  http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.4b/4/4b/The_Film_Daily,_Jan-Jun_1924.djvu
Resolving ms-fe.svc.eqiad.wmnet (ms-fe.svc.eqiad.wmnet)... 10.2.2.27
Connecting to ms-fe.svc.eqiad.wmnet (ms-fe.svc.eqiad.wmnet)|10.2.2.27|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118078853 (113M) [image/vnd.djvu]
Saving to: ‘The_Film_Daily,_Jan-Jun_1924.djvu’

The_Film_Daily,_Jan-Jun_1924.djvu     100%[========================================================================>] 112.61M  27.1MB/s   in 4.3s   

2016-10-05 11:54:09 (26.5 MB/s) - ‘The_Film_Daily,_Jan-Jun_1924.djvu’ saved [118078853/118078853]

We'll probably want to increase that 20s limit since this is a legit use case.

Event Timeline

Since we increased the HTTP_LOADER_REQUEST_TIMEOUT value from 20 to 60 seconds this issue only happened once in 4ish hours, on a 413MB PDF:

Oct  5 16:11:05 thumbor1001 thumbor@8840[56922]: tornado.access:ERROR 504 GET /wikipedia/commons/thumb/0/09/Tallapaka_Adhyatmika_Sankeertanalu.pdf/page566-1877px-Tallapaka_Adhyatmika_Sankeertanalu.pdf.jpg (127.0.0.1) 60064.09ms

Since this thumbnail was rendered fine by Mediawiki (which perhaps has no download timeout?) we should probably increase the timeout limit a bit more.

File: https://commons.wikimedia.org/wiki/File:Tallapaka_Adhyatmika_Sankeertanalu.pdf
Thumbnail: https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Tallapaka_Adhyatmika_Sankeertanalu.pdf/page566-1877px-Tallapaka_Adhyatmika_Sankeertanalu.pdf.jpg

At 120 seconds, still happens for some giant TIFs:

gilles@thumbor1001:~$ cat /var/log/thumbor/thumbor.error.log | grep 504 | grep 120
Oct 12 06:28:20 thumbor1001 thumbor@8814[74871]: 2016-10-12 06:28:20,991 8814 tornado.access:ERROR 504 GET /wikipedia/commons/thumb/a/a1/Fall_aspens_in_the_Colorado_Rockies_LCCN2011636277.tif/lossy-page1-141px-Fall_aspens_in_the_Colorado_Rockies_LCCN2011636277.tif.jpg (127.0.0.1) 120127.24ms
Oct 12 07:10:14 thumbor1001 thumbor@8809[134591]: 2016-10-12 07:10:14,120 8809 tornado.access:ERROR 504 GET /wikipedia/commons/thumb/1/1c/Little_Italy_neighborhood%2C_New_York%2C_New_York_LCCN2011630156.tif/lossy-page1-100px-Little_Italy_neighborhood%2C_New_York%2C_New_York_LCCN2011630156.tif.jpg (127.0.0.1) 120132.49ms
Oct 12 07:25:55 thumbor1001 thumbor@8824[134346]: 2016-10-12 07:25:55,069 8824 tornado.access:ERROR 500 GET /wikipedia/commons/thumb/archive/f/f3/20050410171847%21Himejijou_castle_grounds.jpg/120px-Himejijou_castle_grounds.jpg (127.0.0.1) 325.14ms
Oct 12 09:04:10 thumbor1001 thumbor@8824[134346]: 2016-10-12 09:04:10,573 8824 tornado.access:ERROR 504 GET /wikipedia/commons/thumb/0/02/Golden_Gate_Bridge%2C_San_Francisco%2C_California_LCCN2013633032.tif/lossy-page1-240px-Golden_Gate_Bridge%2C_San_Francisco%2C_California_LCCN2013633032.tif.jpg (127.0.0.1) 120093.72ms
Oct 12 09:04:10 thumbor1001 thumbor@8818[91233]: 2016-10-12 09:04:10,955 8818 tornado.access:ERROR 504 GET /wikipedia/commons/thumb/d/d8/Great_Hall._Detail_of_ceiling_and_cove_showing_Milton_plaque._Library_of_Congress_Thomas_Jefferson_Building%2C_Washington%2C_D.C._LCCN2007684308.tif/lossy-page1-240px-Great_Hall._Detail_of_ceiling_and_cove_showing_Milton_plaque._Library_of_Congress_Thomas_Jefferson_Building%2C_Washington%2C_D.C._LCCN2007684308.tif.jpg (127.0.0.1) 120419.90ms

The TIFFs encountered above are 190MB, 154MB, 458MB and 556MB.

I'm not sure what we can do here, besides increase the limit significantly again. Mediawiki is still fine with generating those thumbnails.

Looking at Mediawiki code, SwiftFileBackend uses the default request timeout for MultiHttpClient, which is 300 seconds. Let's try that value in Thumbor.

If the issue persists, I have a feeling that Mediawiki runs into it sometimes, but that won't show up in the Thumbor logs. I.e. it's likely to be due to concurrency or intermittent congestion, which would happen at different times for Mediawiki and result in failures on other files.

I think this is another argument for logging at the rewrite.py level which files fail for Mediawiki and not for Thumbor and vice versa. This way we would be able to tell if Mediawiki is subject to that same problem in the same random/intermittent fashion. As well as surface more clearly which images Mediawiki can handle that Thumbor can't yet. I'll file a task for that.

Only happened once on a 450MB TIFF since we increased the limit. I'm going to go ahead and assume that my theory is correct and that it happens to Mediawiki as well sometimes. Which will be confirmed by T147918