Page MenuHomePhabricator

Grant slightly broader access to Klaxon
Open, Needs TriagePublic

Description

Currently, Klaxon is available to LDAP members of wmf, wmde, nda, and ops. When anyone else needs to manually page the SRE team, they have to find someone in one of those groups and ask for help.

It's intentional that access is limited: Klaxon can be used to page us when we're not working, and that should be available only to people we trust to use it properly. Even well-intended but incorrect usage, like paging SRE for something that another team needs to handle, would be serious. SREs are able to react urgently to paging alerts because they're rare. If we got spurious pages more than very occasionally, it would tend to create alert fatigue and reduce responsiveness for critical issues, on top of being an unfair intrusion into non-working hours.

But for appropriately trusted community members, Klaxon access would make it easier to get hold of us when we do need to act. Granting access to stewards might be a reasonable place to start, particularly because some stewards have had an easy time getting NDA access and getting access to the tool that way (but full NDA access isn't strictly necessary here).

Separately, we may also want to be able to add trusted individuals such as wiki admins ad hoc, while those individuals are working on active abuse incidents that might require them to page us.

Things to do here, as far as I can tell:

  • Find out whether there's agreement in principle, probably via discussion at the SRE meeting, that we're comfortable with a larger number of trusted community members being able to reach us directly in an emergency
  • Create at least an LDAP group for individual Klaxon access (klaxon-users?) and possibly also a more descriptive group like stewards for role-based access
  • Grant those groups access to Klaxon in puppet and update Klaxon docs
  • Update SRE clinic duty docs so that clinicians can handle steward access requests
  • Communicate to stewards that they can use Klaxon to page us, and what they should use it for; include instructions to create a developer account and get it added

Related Objects

Event Timeline

Only two blockers were raised at the August 7 SRE meeting:

  • Training/docs: We should make sure that anyone who can use Klaxon fully understands what situations it should and shouldn't be used for. The existing language at wikitech:Klaxon and on Klaxon's UI itself is intended to cover this, but we'll make sure it's still sufficiently clear to a broader audience.
  • Technical: If we can control access directly via steward usergroup membership, then we won't have to manually sync access and add a bunch of clinic duty tickets. The downside is that Klaxon currently only uses developer SSO accounts, so some extra engineering might be required. (This might increase Klaxon's dependency on the rest of our infrastructure, but as previously documented (§Klaxon is hosted on...) that's okay; existing monitoring will still automatically alert us to the kinds of sweeping outages that would threaten Klaxon.)

Both of those should be workable. Any further SRE feedback is still welcome (including feedback like "I think we shouldn't do this at all, because...") and I'm happy to proxy it anonymously on behalf of anyone who wants to share it with me confidentially.

One issue that I raised, but perhaps was not captured anywhere is adding some guidance to the documentation on how the folks being paged can communicate with person who used Klaxon. For instance should I assume the person using Klaxon is on IRC and their is a channel we can chat in about the incident?

Is keeping an LDAP group up-to-date with all the stewards something the new IDM could possibly do in the future?

One issue that I raised, but perhaps was not captured anywhere is adding some guidance to the documentation on how the folks being paged can communicate with person who used Klaxon. For instance should I assume the person using Klaxon is on IRC and their is a channel we can chat in about the incident?

Oops, sorry for missing that. The form does say "include how to best get in touch with you," and it's probably already going to be on the user's mind too -- they're there because they want to hear from you, pretty urgently!

But if they forget, their LDAP email address is automatically included in the page, so you could get hold of them that way. (Depending on how we set this up, "LDAP email address" might be replaced with something else, but your point is a good one, and we should make sure to replace it with some sort of direct contact information, not just a username.)

Is keeping an LDAP group up-to-date with all the stewards something the new IDM could possibly do in the future?

Definitely plausible. But it won't be a feature of the new IDM at launch, and we don't want the new IDM to be a blocker for any progress on this task, so we should probably save that for later and build something without it in the nearer term even if it's imperfect. (Adding @SLyngshede-WMF and @joanna_borun to confirm that sounds right.)

One issue that I raised, but perhaps was not captured anywhere is adding some guidance to the documentation on how the folks being paged can communicate with person who used Klaxon. For instance should I assume the person using Klaxon is on IRC and their is a channel we can chat in about the incident?

Oops, sorry for missing that. The form does say "include how to best get in touch with you," and it's probably already going to be on the user's mind too -- they're there because they want to hear from you, pretty urgently!

But if they forget, their LDAP email address is automatically included in the page, so you could get hold of them that way. (Depending on how we set this up, "LDAP email address" might be replaced with something else, but your point is a good one, and we should make sure to replace it with some sort of direct contact information, not just a username.)

Email doesn't seem like a great way to communicate for page worthy incidents, would it be possible to insist on IRC and have them input their handle as part of the Klaxon flow, since we already recommend they chat with us in #wikimedia-sre?

Email doesn't seem like a great way to communicate for page worthy incidents, would it be possible to insist on IRC and have them input their handle as part of the Klaxon flow, since we already recommend they chat with us in #wikimedia-sre?

I'd want @CDanis to weigh in on that, since it's really a Klaxon design decision, but personally I don't think a required field is the right solution. Email isn't our usual medium, but it's better than nothing: if you imagine a trusted user who needs to report a genuine emergency, but doesn't use IRC, they wouldn't be able to contact us at all. Either they'd just give up, or they'd have to go learn what an IRC client is before even being able to tell us something is wrong.

We wouldn't want to exclusively use email during an incident, but I think it's a reasonable way to get in touch and bootstrap to something snappier ("hi, I got your message, please join us in #-sre") -- and I also think it'll be rare that we get a page from someone who doesn't mention it in the first place.

But I get that there are conflicting priorities here ("it should be easy to reach us" vs. "it should be easy for us to reply", both of which are right!) and I'm open to discussing -- and also open to revisiting later if it turns out we get more pages like that than I thought.

Definitely plausible. But it won't be a feature of the new IDM at launch, and we don't want the new IDM to be a blocker for any progress on this task, so we should probably save that for later and build something without it in the nearer term even if it's imperfect. (Adding @SLyngshede-WMF and @joanna_borun to confirm that sounds right.)

If it's just a matter of managing a LDAP group, then that's perfectly within scope of the IDM. It's one of the features that already have a first-attempt implementation. How users are added to the LDAP group until this feature is released is less important as we're importing/reading data from LDAP.

If it's just a matter of managing a LDAP group, then that's perfectly within scope of the IDM. It's one of the features that already have a first-attempt implementation.

Yeah, for sure -- I meant specifically managing an LDAP group by syncing it from a MediaWiki usergroup, in this case the stewards global group. Is anything like that on the roadmap?

(I understand that's not a fully-specified feature request, since some important details like account mapping are left out -- if we go that route we can discuss more in another task.)

If it's just a matter of managing a LDAP group, then that's perfectly within scope of the IDM. It's one of the features that already have a first-attempt implementation.

Yeah, for sure -- I meant specifically managing an LDAP group by syncing it from a MediaWiki usergroup, in this case the stewards global group. Is anything like that on the roadmap?

I'd like to highlight I've filled T344164: VMs requested for stewards a few days ago. My goal is to automate steward/functionaries onboarding needs with a script on that future production VM (there's a whole lot of credentials to issue to new stewards, unfortunately). In theory, that script could have LDAP write permissions and handle the LDAP part as well. Alternatively, it could expose a list of stewards internally to the cluster, and IDM (or whatever) could read it and automatically give stewards additional privileges. Would that be helpful?

As a matter of the first step, I don't think we need to give all stewards access to Klaxon immediately. This is already the case for technical-ish permissions (such as access to Toolforge tool for stewards). Considering stewards have themselves a paging mechanism, if a non-privileged steward has the need to use Klaxon, they could page other stewards and escalate the problem that way.

Certainly, that is not an ideal long-term solution, but it is definitely an improvement from the current status quo. Assuming SRE clinic docs get updated, the group of stewards with access could enlarge organically for now. Once there exists an automation system for the onboarding (from T344164 or elsewhere) that we could wire to, we would be able to easily expand access to all stewards. Would that make sense?