This is a follow up for https://wikitech.wikimedia.org/wiki/Incident_documentation/20190628-EventStreams
The Eventstreams' external service health check currently sends an alarm only to analytics. This was done in the past to avoid spamming the SRE team with notifications when the service was not stable enough. It would be useful to add the SRE team to the contact groups of the check, but first we'd need to create a proper runbook and link it to the alert. The runbook should be created in:
As part of this task we should also understand if a service health check that reports a UNKNOWN state in icinga should alarm in some way after a while (or if the check should be changed in some way). The incident report in fact describes a long outage for the codfw eventstream backend service that was not fixed until somebody by chance looked at the icinga UI and investigated the source of the UNKNOWNS.