Page MenuHomePhabricator

incident 20170323-wikibase did not trigger Icinga paging
Closed, ResolvedPublic


incident 20170323-wikibase was a real site outage but did not cause Icinga paging. Icinga alerted on IRC and users reported there but none of the service checks were paging (critical => true), so ops did not get SMS notifications.

this is a follow-up task to ensure SMS notifications are sent in this kind of outage, either through Icinga or Catchpoint (maybe with a check that actually logs in on wiki like a user would).

Event Timeline

Dzahn triaged this task as Medium priority.Mar 28 2017, 12:34 AM
lmata added a subscriber: lmata.

Untagging the Observability project for now as there doesn't seem to be an action item for the team. Please add back if there is anything we missed.

Well, the action item would be "ensure wikibase alerts are sending pages".

I am not sure what else it should be tagged with. Icinga and alerting seems pretty relevant to observability.

alternative better action item: "ensure SRE gets paged if more than X number of application servers are returning "500 Internal Server Error".

Dzahn added a subscriber: fgiunchedi.

I thought Icinga alerting was a core part of observability work, so a bit confused here.

Dzahn added a project: serviceops.
Dzahn removed subscribers: fgiunchedi, lmata.
lmata raised the priority of this task from Medium to High.Jun 9 2021, 3:20 AM
lmata moved this task from Inbox to Backlog on the observability board.

Apologies i seem to have been confused. Scheduling for review.

Marostegui lowered the priority of this task from High to Medium.Sep 24 2021, 9:51 AM
Marostegui added a subscriber: Marostegui.

What should we do with this task? (I don't think this is high anymore)

thanks for the follow-up, I agree with your assessment, and still an open risk, bumping scheduling.

fgiunchedi claimed this task.

SRE does get paged nowadays when there's a "low" (FSVO low) availability (i.e. high number of 5xx compared to 2xx), which I believe would have caught this incident. I'm going to boldly resolve the task, though feel free to reopen if I'm off the mark here.