Page MenuHomePhabricator

improve controls for on-call rotation management
Closed, ResolvedPublic

Description

We've had several events where folks with Splunk on-call access will modify the on-call rotation instead of creating overrides, thus creating a drift that causes ripples across schedules leading to collisions (IE: people in the same team being on-call same shift/week).

I see two options:

  • Send alert/notification when on-call rotation is altered
  • identify some way to lock down editing for rotation to a set of admins.

Event Timeline

example:
{F36967393}

I had explicitly addressed this at the start of the quarter as it had happened in Q3. so something changed or reverted, but this rotation does not match my records from Apr 1st.

I've linked T313958 since overall both of the options outlined in the description should be features present in the oncall system itself.

Wrt near term options...

Send alert/notification when on-call rotation is altered -- I don't see an option for native VO notifications on schedule changes. There are webhooks for things like on/off call events, but nothing for the schedule itself. We may be able to enable email notifications for watched page edits on wikitech to generate notifications on edits to the oncall schedule, although I *think* we'd need to first enable wgEnotifWatchlist on wikitech.

Identify some way to lock down editing for rotation to a set of admins -- AIUI the same permission that allows self service editing of shift start/end times and overrides also allows for editing of the rotation order. So I'm doubtful that permissions is a viable option in VO, it'd lead to a lot of manual requests for changes and would take away oncallers ability to manage their own schedule.

Another option worth mentioning is sending reminder mails if/when that happens, to remind folks not to edit the roster order and instead to use overrides.

@Dzahn sort of a random question but do you know about the history of wgEnotifWatchlist on wikitech, if that's something we'd considered enabling before?

Hi @herron, @lmata

I don't know about email notifications wgEnotifWatchlist on wikitech but I have other feedback:

re:

Send alert/notification when on-call rotation is altered

There is the IRC bot called "snitch" that is sitting on channel #wikimedia-tech. It does exactly this thing, it "snitches" on IRC when a certain page on a wiki is edited.

This should be usable / reusable to get the feature so you have notifications on a channel of your choice about a page of your choice.

I think this might be the best solution, tbh.

re:

identify some way to lock down editing for rotation to a set of admins.

This _would_ normally be easy. All we'd have to do is go to https://wikitech.wikimedia.org/wiki/SRE/Oncall/Schedule click the "More" menu and select "Protect".

Then only members of "content-administrators" and above (full blown administrators, bureaucrats) can edit the page.

We use this for example to protect the pages with SSH fingerprints so that regular users can't mess with them.

It's not like all SRE is in there by default but some individuals are. So it would mostly be about who should be in the list.

See:

https://wikitech.wikimedia.org/wiki/Special:ListGroupRights

https://wikitech.wikimedia.org/wiki/Wikitech:Content_administrators

https://wikitech.wikimedia.org/w/index.php?title=Special:ListUsers&group=contentadmin

I said "would" though because it would also mean having to go through the following 2 groups that have higher level than content-admin.

https://wikitech.wikimedia.org/w/index.php?title=Special:ListUsers&group=bureaucrat

https://wikitech.wikimedia.org/w/index.php?title=Special:ListUsers&group=interface-admin

It probably wouldn't hurt to review group memberships anyways, one way or another.

It's also possible to create new groups I believe, but have never done that.

It's also possible to create new groups I believe, but have never done that.

The hardest part is coming up with the name

I would like to express my gratitude for all the feedback provided.

It appears that protecting the page may not lead to the outcomes we desire as it is updated programmatically. The page serves as a "workaround" solution to the lack of appropriate visualization for calendaring in our system. Nevertheless, I believe that setting up alerts on the page could aid in detecting unauthorized changes to the rotation and correcting them in a timely manner.

Thanks!

Considering the fact that page is frequently updated by a bot (and so any changes would be overridden), and Wikitech vandalism in general is detected via RC very quickly, I'm not sure any real issue with that page being unprotected. Or is there some sync from Wikitech to VO that would cause problems?

Considering the fact that page is frequently updated by a bot (and so any changes would be overridden), and Wikitech vandalism in general is detected via RC very quickly, I'm not sure any real issue with that page being unprotected. Or is there some sync from Wikitech to VO that would cause problems?

Unfortunately, we lack audit logging for changes to the rotation on the Splunk side, and everyone has editing access. To keep track of historical records of edits, we rely on this page. Hence, it is a roundabout method to track updates if any changes occur on the Splunk side.

There is the IRC bot called "snitch" that is sitting on channel #wikimedia-tech. It does exactly this thing, it "snitches" on IRC when a certain page on a wiki is edited.

This should be usable / reusable to get the feature so you have notifications on a channel of your choice about a page of your choice.

I think this might be the best solution, tbh.

Thanks, this sounds just about perfect to give a consistent heads up when the page is edited. Do you know where to find more info about adding to that config, or deploying another instance?

identify some way to lock down editing for rotation to a set of admins.

This _would_ normally be easy. All we'd have to do is go to https://wikitech.wikimedia.org/wiki/SRE/Oncall/Schedule click the "More" menu and select "Protect".

Good to know thanks, for the context of this task it'd have to be locking down editing within the victorops/splunk interface itself. Our bot fetches the current schedule from VO and then updates the oncall schedule on wikitech, so any manual changes would be overwritten by the bot anyway. AFAICT its a nonstarter since there would be negative side effects to changing the VO permissions.

There is the IRC bot called "snitch" that is sitting on channel #wikimedia-tech. It does exactly this thing, it "snitches" on IRC when a certain page on a wiki is

...

Thanks, this sounds just about perfect to give a consistent heads up when the page is edited. Do you know where to find more info about adding to that config, or deploying another instance?

Not very much but I found Snitch on https://meta.wikimedia.org/wiki/IRC/Bots and that led me to https://github.com/mzmcbride/irc-bots

cc: I am not sure if @MZMcBride is around. He used to do a lot but user seems disabled.

also see: https://wikiindex.org/MZMcBride

.. adding to that config, or deploying another instance?

I found this in Wikitech SALs:

Nova Resource:Tools/SAL/Archive 2
Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.

So seems like it might run out of Toolforge.

wmcs-team might know.

P.S. Unrelatedly I keep thinking we should have a production ganeti VM just for IRC bots that we use in SRE. Possibly my team could help getting one if there is interest.

Found a generalized approach, similar to the snitch bot, that will generate IRC notifications on edits to the wikitech SRE/Oncall/Schedule:

  • Connect to irc.wikimedia.org
  • Join channel #wikitech.wikimedia
  • Enable an IRC notification for the term [[SRE/Oncall/Schedule]]

There is a bot here which logs wikitech edits to this channel, and edits to the oncall schedule will be prefixed with [[SRE/Oncall/Schedule]]

For example:

@rc-pmtpa> [[SRE/Oncall/Schedule]] M https://wikitech.wikimedia.org/w/index.php?diff=2073504&oldid=2073503 * Onfirebot * (-4) updating roster table

P.S. Unrelatedly I keep thinking we should have a production ganeti VM just for IRC bots that we use in SRE. Possibly my team could help getting one if there is interest.

I like this idea and can see many benefits of a dedicated irc bot VM.

Found a generalized approach, similar to the snitch bot, that will generate IRC notifications on edits to the wikitech SRE/Oncall/Schedule:

  • Connect to irc.wikimedia.org
  • Join channel #wikitech.wikimedia
  • Enable an IRC notification for the term [[SRE/Oncall/Schedule]]

There is a bot here which logs wikitech edits to this channel, and edits to the oncall schedule will be prefixed with [[SRE/Oncall/Schedule]]

For example:

@rc-pmtpa> [[SRE/Oncall/Schedule]] M https://wikitech.wikimedia.org/w/index.php?diff=2073504&oldid=2073503 * Onfirebot * (-4) updating roster table

Thank you! I'll try this out. Separately we should keep this discussion going to figure out this immutable rotation problem :-)

So seems like it might run out of Toolforge.

snitch runs on one of MZ's servers. A lot of people have copied/forked it and deployed their own versions.

P.S. Unrelatedly I keep thinking we should have a production ganeti VM just for IRC bots that we use in SRE. Possibly my team could help getting one if there is interest.

Maintenance of IRC bots has been a pretty important entry point for volunteers to contribute and access has usually been given broadly. I would be concerned with any plan to move them to production that doesn't take that into consideration.

Found a generalized approach, similar to the snitch bot, that will generate IRC notifications on edits to the wikitech SRE/Oncall/Schedule:

This is basically all that snitch (and siblings) do, act as a relay from irc.wikimedia.org to FreenodeLibera Chat with some filtering options.

Maintenance of IRC bots has been a pretty important entry point for volunteers to contribute and access has usually been given broadly. I would be concerned with any plan to move them to production that doesn't take that into consideration.

From my side the plan always involved that volunteers (especially the ones currently running bots) also have shell access to this box through an admin group that restricts them to this host.

Keeping this open but lowering priority as the IRC notification workaround is good enough(tm) for the time being while we assess a longer-term strategy

cloud VPS has technical issues right now. the bots are down

We have more audibility now with the wikitech diff + irc alerting that we're "good enough" (tm) for the time being.

herron claimed this task.

Resolving as afaik we've been in steady state here for some time