Page MenuHomePhabricator

incident 20170323-wikibase did not trigger Icinga paging
Open, MediumPublic

Description

incident 20170323-wikibase was a real site outage but did not cause Icinga paging. Icinga alerted on IRC and users reported there but none of the service checks were paging (critical => true), so ops did not get SMS notifications.

this is a follow-up task to ensure SMS notifications are sent in this kind of outage, either through Icinga or Catchpoint (maybe with a check that actually logs in on wiki like a user would).

Event Timeline

Dzahn triaged this task as Medium priority.Mar 28 2017, 12:34 AM
lmata added a subscriber: lmata.

Untagging the Observability project for now as there doesn't seem to be an action item for the team. Please add back if there is anything we missed.

Well, the action item would be "ensure wikibase alerts are sending pages".

I am not sure what else it should be tagged with. Icinga and alerting seems pretty relevant to observability.

alternative better action item: "ensure SRE gets paged if more than X number of application servers are returning "500 Internal Server Error".

Dzahn added a subscriber: fgiunchedi.

I thought Icinga alerting was a core part of observability work, so a bit confused here.

Dzahn added a project: serviceops.
Dzahn removed subscribers: fgiunchedi, lmata.
lmata raised the priority of this task from Medium to High.Jun 9 2021, 3:20 AM
lmata moved this task from Inbox to Backlog on the observability board.

Apologies i seem to have been confused. Scheduling for review.

Marostegui lowered the priority of this task from High to Medium.Fri, Sep 24, 9:51 AM
Marostegui added a subscriber: Marostegui.

What should we do with this task? (I don't think this is high anymore)