Page MenuHomePhabricator

thumbor.svc intermittent timeout on pybal and icinga
Closed, ResolvedPublic

Description

Last night thumbor.svc.eqiad.wmnet paged and immediately recovered, it looks like the service periodically goes into CRITICAL for one check interval and then recovers, sometimes makes it to HARD and pages:

einsteinium:~$ grep thumbor.svc. /var/log/icinga/icinga.log | perl -pe 's/(\d+)/localtime($1)/e'  
[Mon Jul 24 06:36:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 06:37:47 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.005 second response time
[Mon Jul 24 06:52:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 06:52:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.113 second response time
[Mon Jul 24 07:01:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 07:01:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time
[Mon Jul 24 07:17:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 07:18:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time
[Mon Jul 24 07:21:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 07:22:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.004 second response time
[Mon Jul 24 07:25:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 07:26:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.007 second response time
[Mon Jul 24 08:06:26 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 08:07:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time
[Mon Jul 24 08:13:26 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 08:14:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.004 second response time
[Mon Jul 24 09:12:37 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 09:13:36 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.004 second response time
[Mon Jul 24 10:19:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 10:20:46 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time
[Mon Jul 24 10:30:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 10:31:47 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time
[Mon Jul 24 11:22:07 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 11:23:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time
[Mon Jul 24 12:10:26 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
[Mon Jul 24 12:11:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.016 second response time

There was a shortage of memory for thumbor instances which lead to swapping, fixed in https://gerrit.wikimedia.org/r/#/c/367373/ but it looks like there's something else too

Event Timeline

Change 367687 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: fix connections-per-backend in nginx

https://gerrit.wikimedia.org/r/367687

Change 367687 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: fix connections-per-backend in nginx

https://gerrit.wikimedia.org/r/367687

The patch applied above has fixed the intermittent behavior, now pybal and icinga both have stopped flapping