When DNS lookups are unavailable for even very short periods, several nodejs and/or restbase-related services seem to fall into a pattern of persistent failure, even long after DNS service has returned to a normal state. Usually only service daemon restarts seem to fix this state. It's almost as if they persistently-cache the failed hostname lookup in some indirect way.
Old ticket text for posterity, before we understood this a little better:--------------------------
The actual problem here is an unknown, and I don't even have any good working theories.
Background:
- /etc/resolv.conf gets templated from puppet to set recdns for client machines. The first entry in codfw or eqiad is an LVS service IP for local recdns.
- The LVS recdns service is backend by two machines: in eqiad it's hydrogen and chromium. in codfw it's acamar and achernar.
- When we depool a recdns machine, theoretically it should be transparent to clients. UDP DNS queries should continue flowing correctly throughout the depool transaction (but now all to one server). If a recdns client were using persistent TCP (which we don't see any evidence of), it would get a RST but be able to immediately reconnect successfully.
Observations when depooling eqiad recdns:
[note throughout this, hydrogen's recdns service is never stopped or restarted. Just the depool action happened] 2017-04-10 23:14:31 <wikibugs__> (CR) BBlack: [C: 2] eqiad lvs: do not directly use hydrogen temporarily [puppet] - https://gerrit.wikimedia.org/r/347532 (owner: BBlack) 2017-04-10 23:18:53 <logmsgbot> !log bblack@neodymium conftool action : set/pooled=no; selector: name=hydrogen.wikimedia.org,service=pdns_recursor 2017-04-10 23:19:01 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log 2017-04-10 23:21:18 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:18 <icinga-wm> PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:28 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:28 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:28 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:28 <icinga-wm> PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:38 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:38 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:38 <icinga-wm> PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:38 <icinga-wm> PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:38 <icinga-wm> PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:52 <icinga-wm> PROBLEM - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds 2017-04-10 23:21:58 <icinga-wm> PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down! 2017-04-10 23:21:58 <icinga-wm> PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:58 <icinga-wm> PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:21:58 <icinga-wm> PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down! 2017-04-10 23:22:29 <icinga-wm> PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-10 23:22:51 <icinga-wm> RECOVERY - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2247 bytes in 4.661 second response time 2017-04-10 23:23:18 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy 2017-04-10 23:23:28 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy 2017-04-10 23:23:29 <icinga-wm> RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy 2017-04-10 23:23:38 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy 2017-04-10 23:23:38 <icinga-wm> RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy 2017-04-10 23:23:38 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy 2017-04-10 23:23:38 <icinga-wm> RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy 2017-04-10 23:23:38 <icinga-wm> RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy 2017-04-10 23:23:48 <icinga-wm> RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy 2017-04-10 23:23:48 <icinga-wm> RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy 2017-04-10 23:23:48 <icinga-wm> RECOVERY - eventlogging-service-eventbus endpoints health on kafka1003 is OK: All endpoints are healthy 2017-04-10 23:24:08 <icinga-wm> RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy 2017-04-10 23:24:18 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy 2017-04-10 23:24:18 <icinga-wm> RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy 2017-04-10 23:24:28 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy 2017-04-10 23:24:28 <icinga-wm> RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[in this case I did proceed with the upgrade-restart before the messages started coming in] 2017-04-11 02:31:52 <logmsgbot> !log bblack@neodymium conftool action : set/pooled=no; selector: name=chromium.wikimedia.org,service=pdns_recursor 2017-04-11 02:32:03 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log 2017-04-11 02:32:52 <bblack> !log upgrading chromium to pdns-recursor 4.x 2017-04-11 02:33:00 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log 2017-04-11 02:34:28 <icinga-wm> PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:28 <icinga-wm> PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:28 <icinga-wm> PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:28 <icinga-wm> PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:28 <icinga-wm> PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:29 <icinga-wm> PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:29 <icinga-wm> PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:30 <icinga-wm> PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:30 <icinga-wm> PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:31 <icinga-wm> PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:31 <icinga-wm> PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:32 <icinga-wm> PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:32 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:33 <icinga-wm> PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:48 <icinga-wm> PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:48 <icinga-wm> PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:48 <icinga-wm> PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:48 <icinga-wm> PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:34:48 <icinga-wm> PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2017-04-11 02:36:18 <icinga-wm> RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy 2017-04-11 02:36:18 <icinga-wm> RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy 2017-04-11 02:36:28 <icinga-wm> RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy 2017-04-11 02:36:28 <icinga-wm> RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy 2017-04-11 02:36:28 <icinga-wm> RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy 2017-04-11 02:36:28 <icinga-wm> RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy 2017-04-11 02:36:28 <icinga-wm> RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy 2017-04-11 02:36:28 <icinga-wm> RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy 2017-04-11 02:36:28 <icinga-wm> RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy 2017-04-11 02:36:29 <icinga-wm> RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy 2017-04-11 02:36:29 <icinga-wm> RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy 2017-04-11 02:36:30 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy 2017-04-11 02:36:38 <icinga-wm> RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy 2017-04-11 02:36:38 <icinga-wm> RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy 2017-04-11 02:36:38 <icinga-wm> RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy 2017-04-11 02:36:38 <icinga-wm> RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy 2017-04-11 02:36:38 <icinga-wm> RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy 2017-04-11 02:36:38 <icinga-wm> RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy 2017-04-11 02:36:38 <icinga-wm> RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy 2017-04-11 02:36:39 <icinga-wm> RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy 2017-04-11 02:36:39 <icinga-wm> RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy 2017-04-11 02:36:40 <icinga-wm> RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy 2017-04-11 02:36:40 <icinga-wm> RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy 2017-04-11 02:36:41 <icinga-wm> RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy 2017-04-11 02:37:19 <icinga-wm> RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy 2017-04-11 02:37:28 <icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] 2017-04-11 02:37:28 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy 2017-04-11 02:37:28 <icinga-wm> RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy 2017-04-11 02:37:38 <icinga-wm> RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy



