Page MenuHomePhabricator

Paging setup for WMCS
Open, MediumPublic

Description

WMCS has had scads of tickets and discussions around paging over the past couple years looking to get us to a place where we have

  • pages that go to our clinic-duty tech
  • an escalation that reaches the whole WMCS team if that tech isn't able to respond
  • A possible escalation for certain machines (dumps and wiki replicas in particular) to include other SREs/DBAs with direct interest in those.

With the victorops migration, perhaps this is now possible. We would like to be able to use it to try that.

Event Timeline

Bstorm triaged this task as Medium priority.Apr 20 2020, 4:37 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 20 2020, 4:37 PM
Bstorm updated the task description. (Show Details)Apr 20 2020, 4:49 PM
Bstorm updated the task description. (Show Details)Apr 20 2020, 4:53 PM
fgiunchedi moved this task from Inbox to In progress on the observability board.Apr 27 2020, 12:21 PM

I've invited WMCS folks to VO now, you should all have invites in your inbox! Please see https://docs.google.com/document/d/1oP5tNdZKLGVqI-9Vgp0GhnzmBKs4UaIPAkPU_weahMc/edit for additional instructions/setup. Note that you are in WMCS team instead, and there will be additional configuration (e.g. escalation, rotation, and the icinga contact to add), I'm not sure what your preferences are there but it should be fairly straightforward. Let us know!

bd808 added a subscriber: bd808.Apr 28 2020, 10:16 PM

Thanks @fgiunchedi!


I started playing with the setup on the VO side and made some tweaks to the process that the core SRE folks are using. This is just me playing around, and not canonical yet, so NOBODY PANIC! :)

I made 2 rotations: "work hours" and "awake hours". Within each I added a "bd808" shift and set days of week + times of day partial hours that I would normally be willing and able to handle some notification. Then I tweaked the "WMCS default" escalation policy to do:

  • Immediately: notify on-duty users in the "work hours" rotation
  • Unacked after 30 minutes: notify on-duty users in the "awake hours" rotation

This is a really, really crude escalation policy. It is also at least a straw dog to start thinking about how to make better. The first obvious idea of adding to this would be a "clinic duty" rotation where we actually put in our weekly rotation and then sticking that in the escalation policy before poking folks in the "work hours" rotation. I'm sure the team will come up with some other ideas as well.

JHedden assigned this task to Bstorm.May 5 2020, 4:42 PM
JHedden moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
fgiunchedi moved this task from In progress to Radar on the observability board.May 11 2020, 1:46 PM

Update: I chatted with @aborrero today and created a routing key 'wmcs' linked to the default wmcs escalation, icinga emails can be then sent already to the VO address (in private repo)

Change 597047 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] nagios: add victorops-wmcs contact to the wmcs team

https://gerrit.wikimedia.org/r/597047

Change 597047 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] nagios: add victorops-wmcs contact to the wmcs team

https://gerrit.wikimedia.org/r/597047