follow-up task to incident 20190920-d2_switch_failure
https://etherpad.wikimedia.org/p/incident-d2-hosts
services AQS, mobileapps and recommendation-api were down longer than needed after a switch failure
The error shown in systemctl status (for mobileapps and others) was failed DNS lookups (for host statsd.eqiad.wmnet or logstash).
A service restart across the scb* cluster and on AQS hosts fixed the issue.
"Some services (e.g. AQS, mobileapps) continued to be impacted long after the initial blip, and even after recovery of the failed switch. The general theme here for at least some of the nodejs services seems to be related to application-layer caching of transient DNS failures (from the very brief blip of their network reachability towards recursive DNS (and everything)), and they were fixed with service restarts. Other related APIs and services (e.g. restbase and such, and MW APIs that interact with these things) also had alerts due to this indirectly."