Database hosts in the active DC should page when they go down
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Legoktm
	Oct 27 2021, 8:49 PM

Description

Follow-up from https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-25_s3_db_recentchanges_replica

db1112 went down, but did not trigger an alert that paged until it was manually rebooted. In follow-up discussion, @Marostegui said that database hosts in the active DC should page when they go down.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		LSobanski	T295154 Incident: 2021-10-25 s3 db recentchanges replica
		Duplicate		None	T294490 Database hosts in the active DC should page when they go down

Event Timeline

Legoktm created this task.Oct 27 2021, 8:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2021, 8:49 PM

We generally do not page on "Host down/up" events in Icinga. But we do page on mysql replica lag. What paged us was when the host came back up and was booting and Icinga was trying to check the replica lag but got a connection refused in that moment. But the paging had nothing to do with the actual "HOST down" alert in Icinga.

If we want to start paging on the "host down" (ping) check that we add in the base module to all hosts then we need to add some logic that lets us set "critical => true" based on some Hiera key.

But it seemed to me we also don't want ALL database hosts to do that but only masters in the active data center, or am I wrong? So that would have to be sorted out by puppet role or hiera host name level.

We've discussed this in the past already and we actually have T233684: Make primary DB masters page on HOST DOWN alert for that. That particular task only talks about masters indeed, which I would see as a good thing already. I don't know how difficult that is, but from what we've seen it is not easy to page on HOST DOWN for either masters and/or all databases. Any help would be much appreciated.
As Daniel says, a hiera key would be ideal to decide if we'd like to page on an specific host that is not a master as well (ie: sanitarium master).

I am going to merge this task with the one pasted above.

Marostegui closed this task as a duplicate of T233684: Make primary DB masters page on HOST DOWN alert.Oct 28 2021, 4:27 AM

Dzahn mentioned this in T233684: Make primary DB masters page on HOST DOWN alert.Oct 29 2021, 8:09 PM

Ok, thanks! continued here: T233684#7468879

herron added a parent task: T295154: Incident: 2021-10-25 s3 db recentchanges replica.Nov 5 2021, 2:43 PM

Database hosts in the active DC should page when they go downClosed, DuplicatePublicActions

Description

Related ObjectsSearch...

Event Timeline

Database hosts in the active DC should page when they go down
Closed, DuplicatePublic
Actions

Related Objects
Search...