Related to T258336: db1082 crashed and specifically db1082's BBU which failed (and paged) on Jul 18th and then failed again on Jul 25th but a page wasn't issued to VO.
In fact the pages from the 25th deduplicated into the existing incidents on VO which were acknowledged but not resolved, specifically:
- Critical:db1082/MariaDB Replica SQL: s5 #page - https://portal.victorops.com/client/wikimedia#/incident/265/incidentTimeline
- Critical:db1082/MariaDB Replica IO: s5 #page - https://portal.victorops.com/client/wikimedia#/incident/266/incidentTimeline
- Critical:db1082/mysqld processes #page - https://portal.victorops.com/client/wikimedia#/incident/267/incidentTimeline
Normally incidents resolve themselves in VO when icinga issues the recovery, however in this case we had notifications disabled for db1082 shortly after the incident and thus the recoveries never made it to VO to resolve the incidents of the 18th.
Non-exhaustive of non mutually exclusive solutions:
- "disable notifications" should still issue recoveries
- auto-resolve ack'd incidents after a threshold (options for 1h-24h are built in to VO)
- remember to manually resolve VO incidents
- auto-retrigger ack'd incidents after a threshold (options for 1h-24h are built in to VO). This will cause the alert to page again, until resolved.