Page MenuHomePhabricator

Database hosts in the active DC should page when they go down
Closed, DuplicatePublic

Description

Follow-up from https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-25_s3_db_recentchanges_replica

db1112 went down, but did not trigger an alert that paged until it was manually rebooted. In follow-up discussion, @Marostegui said that database hosts in the active DC should page when they go down.

Event Timeline

We generally do not page on "Host down/up" events in Icinga. But we do page on mysql replica lag. What paged us was when the host came back up and was booting and Icinga was trying to check the replica lag but got a connection refused in that moment. But the paging had nothing to do with the actual "HOST down" alert in Icinga.

If we want to start paging on the "host down" (ping) check that we add in the base module to all hosts then we need to add some logic that lets us set "critical => true" based on some Hiera key.

But it seemed to me we also don't want ALL database hosts to do that but only masters in the active data center, or am I wrong? So that would have to be sorted out by puppet role or hiera host name level.

We've discussed this in the past already and we actually have T233684: Make primary DB masters page on HOST DOWN alert for that. That particular task only talks about masters indeed, which I would see as a good thing already. I don't know how difficult that is, but from what we've seen it is not easy to page on HOST DOWN for either masters and/or all databases. Any help would be much appreciated.
As Daniel says, a hiera key would be ideal to decide if we'd like to page on an specific host that is not a master as well (ie: sanitarium master).

I am going to merge this task with the one pasted above.