Page MenuHomePhabricator

On-call batphone escalation configuration holidays FY2023-24
Open, In Progress, LowPublic

Description

This is a tracking task to outline changes to paging routing for on-call rotation during specific holidays as outlined in https://wikitech.wikimedia.org/wiki/SRE/Oncall, favoring the batphone for these instances. Notices will be sent to internal mailing lists.

  • November 1st, 2023 (EMEA)
  • November 10th, Veterans day (Americas)
  • November 23rd/24th, Thanksgiving (Americas)
  • December Last 25th-29th End of year Holiday (Global)
  • Monday, January 1st, New Year's Day (Americas)
  • Monday, January 15th Martin Luther King Jr Day (Americas)
  • Monday, February 19th U.S. Presidents' Day (Americas)
  • Friday, March 29th - Good friday (EMEA)
  • Monday April 1st - Easter (EMEA)
  • Monday, April 22nd, Earth Day (Americas)
  • Wednesday, May 1st, Intl Labor Day (EMEA)
  • Thursday, May 9th, Feast of the Ascension (EMEA)
  • Monday, May 27th Memorial Day (Americas)
  • Wednesday, June 19th Juneteenth (Americas)

This list may not be complete; please help by expanding it (with broadly known holidays not on this list for EMEA, Canada, Mexico, et, al.).

Ref\ Employing the same methods/process as T340763: Adjusting On-Call Escalation Policies in Splunk for Upcoming 2023 July 4th

Event Timeline

  • Removed EMEA Rotation from escalation, will need to re-add afternoon of Nov 1st (americas).
lmata updated the task description. (Show Details)
lmata changed the task status from Open to In Progress.Nov 2 2023, 5:22 AM
lmata triaged this task as Low priority.

One additional request- the last week of the year it is wmf holidays and I was scheduled for clinic duty. I still want to do my part, so if you can override that (nobody is supposed to be around during the Christmas week) and add me at the end of the schedule- Or if it is too much change, I can just be the person to ping if someone cannot do their clinic duty turn, or help the person after christmas, whatever you see fit?

@jcrespo no worries, I'll take care of the override. Due to splunk config/UI restrictions, the calendar shows active members, but those folks have been "marked" in the internal roster as unavailable for that week. I'll add you to the rota for the next round. Thanks!

Aklapper renamed this task from On-call batphone escalation configuration holidays Q2 FY2023 to On-call batphone escalation configuration holidays Q2 FY2023-24.Nov 6 2023, 1:54 PM

[Corrected title as Q2 FY2023 was in late 2022]

[Corrected title as Q2 FY2023 was in late 2022]

many thanks!

Added Veterans day US to the list.

lmata renamed this task from On-call batphone escalation configuration holidays Q2 FY2023-24 to On-call batphone escalation configuration holidays FY2023-24.Nov 9 2023, 5:48 PM
lmata updated the task description. (Show Details)

Monday, January 1st, New Year's Day (Americas)
Monday, April 22nd, Earth Day (Americas)

These are actually global holidays as of https://office.wikimedia.org/wiki/HR_Corner/Holiday_List

lmata updated the task description. (Show Details)

@lmata the UI looks a bit like a lot of individuals are on-call rather than batphone. Is that intentional, or am I confused?

@MatthewVernon, IT does look weird. I think it's just the UI; when I added Batphone to the escalation path instead of the EMEA/Americas rotation, it seems to have expanded the Batphone list as the escalation and the on/off-calls folks according to their individual schedule set within the Batphone rotation.

Batphone has been removed, and the business-hours on-call rota is enabled again in Splunk on-call.

batphone enabled for MLK

We are back to regular on-call

Screenshot 2024-02-18 at 7.06.33 PM.png (776×2 px, 111 KB)

Just need to add Americas back on Tue Feb 20th

lmata updated the task description. (Show Details)

Set batphone for today until Monday COB

Mentioned in SAL (#wikimedia-operations) [2024-04-02T08:02:49Z] <godog> restore SRE business hours routing/escalation after the holidays - T350192

We will need to re-enable regular on-call tomorrow

Mentioned in SAL (#wikimedia-operations) [2024-04-23T08:04:47Z] <godog> restore sre business hour escalation policy - T350192

@lmata FYI today VO thought that Jaime and me were both oncall instead of the people we had swapped with. I fixed it for this week but also noticed that for all the existing overrides, the EMEA escalation routing have been unassigned. Possibly related to the May 1st batphone change for EMEA.
I didn't change the other overrides (that hence need fixing) and also wanted to brought up the issue because maybe the current procedure to put the batphone for a day might be the cause of this.

@lmata FYI today VO thought that Jaime and me were both oncall instead of the people we had swapped with. I fixed it for this week but also noticed that for all the existing overrides, the EMEA escalation routing have been unassigned. Possibly related to the May 1st batphone change for EMEA.
I didn't change the other overrides (that hence need fixing) and also wanted to brought up the issue because maybe the current procedure to put the batphone for a day might be the cause of this.

thanks! filed task T364006: Splunk On-Call resets overrides after changes to the escalation to track this separately

Mentioned in SAL (#wikimedia-operations) [2024-05-09T08:13:23Z] <godog> set batphone oncall for May 9th - T350192

Mentioned in SAL (#wikimedia-operations) [2024-05-09T08:30:48Z] <godog> set batphone oncall for May 9th only for EMEA, not Americas - T350192

Mentioned in SAL (#wikimedia-operations) [2024-05-10T08:30:40Z] <godog> restore SRE business hours oncall for EMEA - T350192

Mentioned in SAL (#wikimedia-operations) [2024-05-10T08:30:40Z] <godog> restore SRE business hours oncall for EMEA - T350192

I noticed that batphone wasn't in the steps for sre business hours escalation, I've added back a step to route to sre emea rotation and this worked. Removing the step I believe resets the overrides when we put back the step again. I've then reset the overrides and put back folks that were there previously, which restored things as they were.

I couldn't find a documented process to temporarily set batphone for emea or americas, though we should definitely have one! cc @lmata

Mentioned in SAL (#wikimedia-operations) [2024-05-10T08:30:40Z] <godog> restore SRE business hours oncall for EMEA - T350192

I noticed that batphone wasn't in the steps for sre business hours escalation, I've added back a step to route to sre emea rotation and this worked. Removing the step I believe resets the overrides when we put back the step again. I've then reset the overrides and put back folks that were there previously, which restored things as they were.

To clarify, what happens is that overrides get unset for the escalation "SRE Business Hours (Escalation)" and folks need to be added back (e.g. for shift swaps)

Also extra confusingly (to me) we have the following escalations:

  • SRE Business Hours (Escalation)
  • Business Hours Americas (no routing keys associated)
  • Business Hours EMEA (no routing keys associated)

I'm not sure how/if the last two are used, though if they are not I think we should remove them