Last night thumbor.svc.eqiad.wmnet paged and immediately recovered, it looks like the service periodically goes into CRITICAL for one check interval and then recovers, sometimes makes it to HARD and pages:
einsteinium:~$ grep thumbor.svc. /var/log/icinga/icinga.log | perl -pe 's/(\d+)/localtime($1)/e' [Mon Jul 24 06:36:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 06:37:47 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.005 second response time [Mon Jul 24 06:52:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 06:52:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.113 second response time [Mon Jul 24 07:01:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 07:01:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time [Mon Jul 24 07:17:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 07:18:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time [Mon Jul 24 07:21:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 07:22:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.004 second response time [Mon Jul 24 07:25:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 07:26:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.007 second response time [Mon Jul 24 08:06:26 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 08:07:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time [Mon Jul 24 08:13:26 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 08:14:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.004 second response time [Mon Jul 24 09:12:37 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 09:13:36 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.004 second response time [Mon Jul 24 10:19:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 10:20:46 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time [Mon Jul 24 10:30:56 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 10:31:47 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time [Mon Jul 24 11:22:07 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 11:23:06 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time [Mon Jul 24 12:10:26 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [Mon Jul 24 12:11:16 2017] SERVICE ALERT: thumbor.svc.eqiad.wmnet;LVS HTTP IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.016 second response time
There was a shortage of memory for thumbor instances which lead to swapping, fixed in https://gerrit.wikimedia.org/r/#/c/367373/ but it looks like there's something else too