Page MenuHomePhabricator

nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability
Open, HighPublic

Description

When DNS lookups are unavailable for even very short periods, several nodejs and/or restbase-related services seem to fall into a pattern of persistent failure, even long after DNS service has returned to a normal state. Usually only service daemon restarts seem to fix this state. It's almost as if they persistently-cache the failed hostname lookup in some indirect way.

Old ticket text for posterity, before we understood this a little better:--------------------------

The actual problem here is an unknown, and I don't even have any good working theories.

Background:

  1. /etc/resolv.conf gets templated from puppet to set recdns for client machines. The first entry in codfw or eqiad is an LVS service IP for local recdns.
  2. The LVS recdns service is backend by two machines: in eqiad it's hydrogen and chromium. in codfw it's acamar and achernar.
  3. When we depool a recdns machine, theoretically it should be transparent to clients. UDP DNS queries should continue flowing correctly throughout the depool transaction (but now all to one server). If a recdns client were using persistent TCP (which we don't see any evidence of), it would get a RST but be able to immediately reconnect successfully.

Observations when depooling eqiad recdns:

[note throughout this, hydrogen's recdns service is never stopped or restarted.   Just the depool action happened]
2017-04-10 23:14:31	<wikibugs__>	(CR) BBlack: [C: 2] eqiad lvs: do not directly use hydrogen temporarily [puppet] - https://gerrit.wikimedia.org/r/347532 (owner: BBlack)
2017-04-10 23:18:53	<logmsgbot>	!log bblack@neodymium conftool action : set/pooled=no; selector: name=hydrogen.wikimedia.org,service=pdns_recursor
2017-04-10 23:19:01	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2017-04-10 23:21:18	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:18	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:28	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:28	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:28	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:28	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:38	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:38	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:38	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:38	<icinga-wm>	PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:38	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:52	<icinga-wm>	PROBLEM - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
2017-04-10 23:21:58	<icinga-wm>	PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down!
2017-04-10 23:21:58	<icinga-wm>	PROBLEM - eventlogging-service-eventbus endpoints health on kafka1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:58	<icinga-wm>	PROBLEM - eventlogging-service-eventbus endpoints health on kafka1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:21:58	<icinga-wm>	PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka1002.eqiad.wmnet because of too many down!
2017-04-10 23:22:29	<icinga-wm>	PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-10 23:22:51	<icinga-wm>	RECOVERY - LVS HTTP IPv4 on eventbus.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2247 bytes in 4.661 second response time
2017-04-10 23:23:18	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
2017-04-10 23:23:28	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
2017-04-10 23:23:29	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
2017-04-10 23:23:38	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
2017-04-10 23:23:38	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
2017-04-10 23:23:38	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
2017-04-10 23:23:38	<icinga-wm>	RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy
2017-04-10 23:23:38	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
2017-04-10 23:23:48	<icinga-wm>	RECOVERY - eventlogging-service-eventbus endpoints health on kafka1002 is OK: All endpoints are healthy
2017-04-10 23:23:48	<icinga-wm>	RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
2017-04-10 23:23:48	<icinga-wm>	RECOVERY - eventlogging-service-eventbus endpoints health on kafka1003 is OK: All endpoints are healthy
2017-04-10 23:24:08	<icinga-wm>	RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
2017-04-10 23:24:18	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
2017-04-10 23:24:18	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
2017-04-10 23:24:28	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
2017-04-10 23:24:28	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[in this case I did proceed with the upgrade-restart before the messages started coming in]
2017-04-11 02:31:52	<logmsgbot>	!log bblack@neodymium conftool action : set/pooled=no; selector: name=chromium.wikimedia.org,service=pdns_recursor
2017-04-11 02:32:03	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2017-04-11 02:32:52	<bblack>	!log upgrading chromium to pdns-recursor 4.x
2017-04-11 02:33:00	<stashbot>	Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2017-04-11 02:34:28	<icinga-wm>	PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:28	<icinga-wm>	PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:28	<icinga-wm>	PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:28	<icinga-wm>	PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:28	<icinga-wm>	PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:29	<icinga-wm>	PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:29	<icinga-wm>	PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:30	<icinga-wm>	PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:30	<icinga-wm>	PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:31	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:31	<icinga-wm>	PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:32	<icinga-wm>	PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:32	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:33	<icinga-wm>	PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:48	<icinga-wm>	PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:48	<icinga-wm>	PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:48	<icinga-wm>	PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:48	<icinga-wm>	PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:34:48	<icinga-wm>	PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2017-04-11 02:36:18	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
2017-04-11 02:36:18	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
2017-04-11 02:36:28	<icinga-wm>	RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
2017-04-11 02:36:28	<icinga-wm>	RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy
2017-04-11 02:36:28	<icinga-wm>	RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
2017-04-11 02:36:28	<icinga-wm>	RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy
2017-04-11 02:36:28	<icinga-wm>	RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy
2017-04-11 02:36:28	<icinga-wm>	RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy
2017-04-11 02:36:28	<icinga-wm>	RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy
2017-04-11 02:36:29	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
2017-04-11 02:36:29	<icinga-wm>	RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy
2017-04-11 02:36:30	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
2017-04-11 02:36:38	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
2017-04-11 02:36:38	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
2017-04-11 02:36:38	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
2017-04-11 02:36:38	<icinga-wm>	RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy
2017-04-11 02:36:38	<icinga-wm>	RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy
2017-04-11 02:36:38	<icinga-wm>	RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
2017-04-11 02:36:38	<icinga-wm>	RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy
2017-04-11 02:36:39	<icinga-wm>	RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy
2017-04-11 02:36:39	<icinga-wm>	RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy
2017-04-11 02:36:40	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
2017-04-11 02:36:40	<icinga-wm>	RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy
2017-04-11 02:36:41	<icinga-wm>	RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
2017-04-11 02:37:19	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
2017-04-11 02:37:28	<icinga-wm>	PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
2017-04-11 02:37:28	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
2017-04-11 02:37:28	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
2017-04-11 02:37:38	<icinga-wm>	RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There are big drops in *action* API backend requests and huge spikes in latency around both times:

The median REST API backend request latency as observed by Varnish remained stable around the same time:

The same action API latency spikes were observed by RESTBase:

The rate of external requests hitting RESTBase & served from storage did not change at all during the same time:

Going through the rates and latencies of REST API end points, it seems that primarily end points that directly or indirectly hit the action API (for example, via Parsoid) slowed down. Others, like pageviews, math or citations, did not change in rate or latency.

Overall, the data points towards high latency in the action API after the DNS de-pools, especially of the median latency. There does not seem to be a slow-down in node DNS resolution in general, as evidenced by the lack of issues between node services (like mathoid, citoid, aqs).

Do we know which exact URLs failed in icinga? If so, we could double check if those entry points do indeed rely on the action API. We got mobileapps and RB alerts on action API outages before, so this looks very likely, but it would be good to eliminate the possibility that one of the other checks failed as well.

One thing that I don't have an explanation for is the eventlogging-service-eventbus issue. Afaik this is hitting the python eventbus service, which should not depend on the action API.

FWIW - I did the same depooling (for reinstalls) in codfw this afternoon, and there was no impact in that case. So this seems to also be eqiad-specific (but that could be just an effect of all the real user load being in eqiad - maybe the same problem, whatever it is, happens in codfw but it's a non-issue due to light load)

Lets perhaps tackle T123854, so that icinga also keeps an eye on the action API?

BBlack changed the task status from Open to Stalled.Apr 18 2017, 6:18 PM

@GWicke yeah we should.

Regardless, after we're done with the current codfw switchover/switchback, I think we should get some relevant parties lined up to look into the appropriate logs/traffic in realtime and test another recdns depool in eqiad and investigate this a little further experimentally.

BBlack renamed this task from icinga alerts on nodejs services when a recdns server is depooled to nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability.Dec 6 2019, 2:11 PM
BBlack updated the task description. (Show Details)

While we'll work on improvements that make this less-likely in the first place in various DNS infra tickets like T171498 , and we're happy to help debugging this with someone, ultimately this is an applayer problem independent of our DNS infra, so I'm moving it over to the Watching column.

Aklapper changed the task status from Stalled to Open.Nov 2 2020, 5:52 PM

The previous comments don't explain nowadays who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on") as T123854 was resolved in the meantime. Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.