Page MenuHomePhabricator

mobileapps/aqs/recommendation-api (nodejs services) improve resilience against short network outages
Closed, DuplicatePublic

Description

follow-up task to incident 20190920-d2_switch_failure

https://etherpad.wikimedia.org/p/incident-d2-hosts

services AQS, mobileapps and recommendation-api were down longer than needed after a switch failure

The error shown in systemctl status (for mobileapps and others) was failed DNS lookups (for host statsd.eqiad.wmnet or logstash).

A service restart across the scb* cluster and on AQS hosts fixed the issue.

"Some services (e.g. AQS, mobileapps) continued to be impacted long after the initial blip, and even after recovery of the failed switch. The general theme here for at least some of the nodejs services seems to be related to application-layer caching of transient DNS failures (from the very brief blip of their network reachability towards recursive DNS (and everything)), and they were fixed with service restarts. Other related APIs and services (e.g. restbase and such, and MW APIs that interact with these things) also had alerts due to this indirectly."

Event Timeline

Dzahn renamed this task from mobileapps/aqs/receommendation-api (nodejs services) improve resilience against short network outages to mobileapps/aqs/recommendation-api (nodejs services) improve resilience against short network outages.Sep 23 2019, 8:38 PM
Dzahn added projects: serviceops, Services.