Page MenuHomePhabricator

Move icinga alarm for the EventStreams external endpoint to SRE
Closed, ResolvedPublic1 Estimated Story Points

Description

This is a follow up for https://wikitech.wikimedia.org/wiki/Incident_documentation/20190628-EventStreams

The Eventstreams' external service health check currently sends an alarm only to analytics. This was done in the past to avoid spamming the SRE team with notifications when the service was not stable enough. It would be useful to add the SRE team to the contact groups of the check, but first we'd need to create a proper runbook and link it to the alert. The runbook should be created in:

https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration

As part of this task we should also understand if a service health check that reports a UNKNOWN state in icinga should alarm in some way after a while (or if the check should be changed in some way). The incident report in fact describes a long outage for the codfw eventstream backend service that was not fixed until somebody by chance looked at the icinga UI and investigated the source of the UNKNOWNS.

Event Timeline

+1 I think this alarm should alert SRE.

Change 520475 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] eventstreams: add admins contact to eventstreams check

https://gerrit.wikimedia.org/r/520475

Milimetric raised the priority of this task from Medium to High.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Change 520475 merged by Ottomata:
[operations/puppet@production] eventstreams: add admins contact to eventstreams check

https://gerrit.wikimedia.org/r/520475

Nuria set the point value for this task to 1.
elukey lowered the priority of this task from High to Medium.

We didn't discuss if SERVICE UNKNOWN needs to alarm or not for some services :)