Page MenuHomePhabricator

Pairing tool for new SREs using sudo under supervision
Open, MediumPublic

Description

We're moving toward a process where new SREs don't get global root immediately. That's an important element of being able to confidently hire people who are earlier in their careers, but it makes it hard for new SREs to learn hands-on in production, because currently you need root to do most SRE work.

As a training tool, we would benefit from being able to pair a new SRE with an experienced buddy:

  1. Both SREs SSH to the same host.
  2. The new SRE (who can't use sudo directly) runs some other very sudo-like command, giving the command they want to run as root.
  3. The experienced SRE approves the command, triggering it to actually run as root (or, if it's wrong, they decline and the new person fixes it and tries again).

The advantage is that the new SRE is much more actively involved than just watching on a screenshare, but still can't make mistakes quite as impactful as if they had root.

Note that defending against intentional attacks by the new SRE is out of scope. They could probably trick their buddy into approving a command that does something unexpected, but that doesn't mean this system is defective: we'll only give this access to people we trust with it, as we do now with root. The goal is to provide a guardrail against mistakes, not malfeasance.

Originally I thought we would need to build this tool ourselves, but @CDanis found sudo_pair which was open-sourced by Square and is listed as a sudo plugin. I haven't dug into the implementation at all, but the description looks promising, as does the demo where all the output is shown to the buddy, with a killswitch -- so something like "sudo bash" is still supervised.

(For clarity: It sounds like in Square's use of sudo_pair, they require pairing for all SREs except in emergencies. I don't propose we do that here -- I only want it as a training tool.)

Can Infrastructure Foundations evaluate whether and how we can run this in prod? And if we can't, can we investigate building a solution of our own? I'm happy to consult and would love to be involved, but I don't have the domain expertise to configure this safely on my own.

Event Timeline

One future alternative may be to use new approval plugin introduced in sudo 1.9 which would allow to write a custom approval check with a relatively simple Python script: https://www.sudo.ws/posts/2020/08/sudo-1.9-using-the-new-approval-api-from-python/ The downside is that this is only available in Bullseye and Python support isn't enabled yet (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=990855), so it's currently not really an interesting option (we should not deviate from the default Debian sudo packages. We could build the sudo plugin into a separate sudo-python binary package (built from a separate source package), but it would only be available for the sudo version in bullseye.

There's no sudo-dev headers package, but the sudo plugin infrastructure in sudo seems relatively stable. It's worth keeping in mind that we'd typically need to cover the sudo versions of three different Debian releases (any new SRE starting in Search would need to use this plugin with Stretch etc.) and the upstream choice to write this is in Rust may make this tricky. But we can probably also cover this by building on Bullseye and reusing the build for older distros (the generated plugin so file will probably work fine with older distros equally).

We can definitely have a closer look at sudo_pair, what's the time frame for when this is needed?

Thanks for checking it out! Even if sudo_pair were only available in bullseye at first, that would be a huge step forward, because we'd be able to use it on cumin2002. (But we do want to get it everywhere eventually, so I agree the custom approval check isn't attractive yet.)

The time frame mostly depends on hiring: I'd really like to have it ready (including new-hire-friendly documentation) in time for the next new SRE, so that they can start using it in their first few days. Based on what I hear about the hiring pipeline, I'd expect that to be sometime around April, maybe March at the earliest, so we have some time.

Ack, I'll have a closer look over the course of February

jbond triaged this task as Medium priority.Feb 16 2022, 4:55 PM
jbond subscribed.

@MoritzMuehlenhoff Checking in -- have you had any time to take a look at this?

Checking in about this again, as it'd be useful for intern project work. Even just being able to use it on part of the fleet would be nice :)

Checking in about this again, as it'd be useful for intern project work. Even just being able to use it on part of the fleet would be nice :)

Sorry, I've had no chance to look into this so far.

To keep archives happy: T360356#9949479

We filed a proposal to basically implement sudo_pair "socially", as starting experiment. While at it (to unblock dcops), we could prioritize this task as well.

Is there an update on this? We have a new team member joining us and this will be super helpful as we onboard them.

@Kappakayala Hi! As I wrote in my last comment some days ago, we have a proposal for a middle-ground solution in T360356#9949479

The problem is that sudo_pair needs a very extensive security review, since it would change a critical tool of our infrastructure and we want to make sure that we don't open any potential security vulnerability/holes in the process. So we are likely going to restart the work for this task in September, when most of the I/F team will be back (including Moritz, that should definitely be involved in the conversation).

In the meantime, for new SREs, this is my proposal:

  • They are added to the ops-limited posix group (still WIP in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054894), getting access to production.
  • Permission-wise, they are able to ssh/access every server, check syslog and other restricted files, and run basic sudo commands (like dmesg, etc..).
  • Every other sudo-privileged operation needs to be done pairing with another SRE with full root.

It is not like we had sudo_pair, but it should be a good starting point. Lemme know!

The new ops-limited group is live, just sent an email to all SREs about it.

Today we saw another good use case for sudo_pair: while troubleshooting and firefighting a Phatality deploy gone wrong (T374880), several different times @dancy had to ask SRE to run commands as root on logstash collector hosts. Having sudo_pair available likely would have sped up both diagnosis and a fix.