recommendation-api (nodejs services) improve resilience against short network outages
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Dzahn
	Sep 23 2019, 8:00 PM

Description

follow-up task to incident 20190920-d2_switch_failure

https://etherpad.wikimedia.org/p/incident-d2-hosts

services AQS, mobileapps and recommendation-api were down longer than needed after a switch failure

The error shown in systemctl status (for mobileapps and others) was failed DNS lookups (for host statsd.eqiad.wmnet or logstash).

A service restart across the scb* cluster and on AQS hosts fixed the issue.

"Some services (e.g. AQS, mobileapps) continued to be impacted long after the initial blip, and even after recovery of the failed switch. The general theme here for at least some of the nodejs services seems to be related to application-layer caching of transient DNS failures (from the very brief blip of their network reachability towards recursive DNS (and everything)), and they were fixed with service restarts. Other related APIs and services (e.g. restbase and such, and MW APIs that interact with these things) also had alerts due to this indirectly."

Related Objects

Mentioned Here: T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability

Event Timeline

Dzahn created this task.Sep 23 2019, 8:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2019, 8:00 PM

Dzahn updated the task description. (Show Details)Sep 23 2019, 8:15 PM

Dzahn renamed this task from mobileapps/aqs/receommendation-api (nodejs services) improve resilience against short network outages to mobileapps/aqs/recommendation-api (nodejs services) improve resilience against short network outages.Sep 23 2019, 8:38 PM

Dzahn added projects: serviceops, Services.

Is this a duplicate or superset of T162818?

jijiki subscribed.Dec 2 2019, 7:47 AM

BBlack closed this task as a duplicate of T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability.Dec 6 2019, 2:08 PM

mobileapps/aqs/recommendation-api (nodejs services) improve resilience against short network outagesClosed, DuplicatePublicActions

Description

Related Objects

Event Timeline

mobileapps/aqs/recommendation-api (nodejs services) improve resilience against short network outages
Closed, DuplicatePublic
Actions