Page MenuHomePhabricator

Nginx doesn't retry on a different Thumbor instance when the current one dies
Closed, InvalidPublic

Description

This happens for about 1600 requests per day, or 0.02% of all requests.

This scenario results in Nginx serving a 502. The nginx error log shows things like the following where several pending requests on the same Thumbor instance die at the same time:

2017/03/28 15:05:15 [error] 2318#2318: *30599 upstream prematurely closed connection while reading response header from upstream, client: 10.64.0.38, server: , request: "GET /wikipedia/commons/thumb/e/ee/The_Holy_Bible_faithfvlly_translated_into_English_ovt_of_the_authentical_Latin%2C_diligently_conferred_with_the_Hebrew%2C_Greek%2C_%26_other_Editions_in_diuers_languages.pdf/page482-659px-thumbnail.pdf.jpg HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/e/ee/The_Holy_Bible_faithfvlly_translated_into_English_ovt_of_the_authentical_Latin%2C_diligently_conferred_with_the_Hebrew%2C_Greek%2C_%26_other_Editions_in_diuers_languages.pdf/page482-659px-thumbnail.pdf.jpg", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2340#2340: *30319 upstream prematurely closed connection while reading response header from upstream, client: 10.64.0.38, server: , request: "GET /wikipedia/commons/thumb/2/22/Las_Estrellas_%282016_Logo%29.png/375px-Las_Estrellas_%282016_Logo%29.png HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/2/22/Las_Estrellas_%282016_Logo%29.png/375px-Las_Estrellas_%282016_Logo%29.png", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2338#2338: *36089 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.32.221, server: , request: "GET /wikipedia/commons/thumb/7/76/Logo_Tournament_of_Champions_2012.jpg/130px-Logo_Tournament_of_Champions_2012.jpg HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/7/76/Logo_Tournament_of_Champions_2012.jpg/130px-Logo_Tournament_of_Champions_2012.jpg", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2339#2339: *36127 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.32.220, server: , request: "GET /wikipedia/commons/thumb/c/c7/Modigiliani_13.jpg/344px-Modigiliani_13.jpg HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/c/c7/Modigiliani_13.jpg/344px-Modigiliani_13.jpg", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2337#2337: *36167 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.0.39, server: , request: "GET /wikipedia/commons/thumb/7/7f/Absence_makes_the_heart_grow_fonder_ HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/7/7f/Absence_makes_the_heart_grow_fonder_", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2331#2331: *36199 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.0.38, server: , request: "GET /wikipedia/en/thumb/5/5a/Seymourstill.JPG/200px-Seymourstill.JPG HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/en/thumb/5/5a/Seymourstill.JPG/200px-Seymourstill.JPG", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2336#2336: *36611 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.0.38, server: , request: "GET /wikipedia/en/thumb/a/af/Time_crystal_phase_transition.png/320px-Time_crystal_phase_transition.png HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/en/thumb/a/af/Time_crystal_phase_transition.png/320px-Time_crystal_phase_transition.png", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2335#2335: *36709 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.0.39, server: , request: "GET /wikipedia/commons/thumb/e/e1/Portal_icon.svg/15px-Portal_icon.svg.png HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/e/e1/Portal_icon.svg/15px-Portal_icon.svg.png", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2326#2326: *37371 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.0.39, server: , request: "GET /wikipedia/en/thumb/0/0c/Red_pog.svg/6px-Red_pog.svg.png HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/en/thumb/0/0c/Red_pog.svg/6px-Red_pog.svg.png", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2333#2333: *37255 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.32.220, server: , request: "GET /wikipedia/commons/thumb/3/33/Korean.food-Seolleungtang-01.jpg/260px-Korean.food-Seolleungtang-01.jpg HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/3/33/Korean.food-Seolleungtang-01.jpg/260px-Korean.food-Seolleungtang-01.jpg", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2334#2334: *37035 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.32.220, server: , request: "GET /wikipedia/commons/thumb/9/9b/Seneca-berlinantikensammlung-1.jpg/220px-Seneca-berlinantikensammlung-1.jpg HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/9/9b/Seneca-berlinantikensammlung-1.jpg/220px-Seneca-berlinantikensammlung-1.jpg", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2332#2332: *37751 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.32.220, server: , request: "GET /wikipedia/id/thumb/5/53/Pinkan_Mambo.png/720px-Pinkan_Mambo.png HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/id/thumb/5/53/Pinkan_Mambo.png/720px-Pinkan_Mambo.png", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2330#2330: *37589 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.0.38, server: , request: "GET /wikipedia/commons/thumb/5/50/TehranIranSkyline.jpg/800px-TehranIranSkyline.jpg HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/5/50/TehranIranSkyline.jpg/800px-TehranIranSkyline.jpg", host: "thumbor.svc.eqiad.wmnet:8800"
2017/03/28 15:05:15 [error] 2301#2301: *38695 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.32.221, server: , request: "GET /wikipedia/commons/thumb/3/3c/Pharoah%27s_Daughter_-Pas_de_Fleche_-Mathilde_Kschessinska_-1898_-2.JPG/383px-Pharoah%27s_Daughter_-Pas_de_Fleche_-Mathilde_Kschessinska_-1898_-2.JPG HTTP/1.1", upstream: "http://127.0.0.1:8806/wikipedia/commons/thumb/3/3c/Pharoah%27s_Daughter_-Pas_de_Fleche_-Mathilde_Kschessinska_-1898_-2.JPG/383px-Pharoah%27s_Daughter_-Pas_de_Fleche_-Mathilde_Kschessinska_-1898_-2.JPG", host: "thumbor.svc.eqiad.wmnet:8800"

What's unclear is why the proxy_next_upstream settings aren't kicking in and resulting in a retry on a different upstream.

Event Timeline

Gilles triaged this task as Medium priority.Mar 28 2017, 3:20 PM
Gilles lowered the priority of this task from Medium to Low.

Even with explicit declaration of " proxy_next_upstream error timeout http_502;" nginx never retries requests that die because the thumbor process dies, or that fail because the thumbor process isn't ready yet (those generate a "111: Connection refused" in the nginx error log.

I'm surprised that those don't fall under what the "error" parameter of proxy_next_upstream would imply. Clearly they are errors from nginx's point of view if they end up in the error log...

OK, so something that should have been obvious that I just noticed is that it's a bad idea to restart all thumbor instances at once. Because they're all starting up at the same time, and all unavailable to handle requests during the startup period, which means that the retry does happen, but it lands on an equally busy instance starting up. We should avoid restarting all thumbor instances at once in the future and do a graceful rolling restart if possible. I'll create a low-prio task for that.

After further investigation, the retry mechanism is working properly, and the only ones that end up served back as errors are requests that did fail twice on different upstreams (such as what happens when all instances get restarted at once).