Page MenuHomePhabricator

Toolforge: Re-evaluate root and user SSH access to nodes
Closed, InvalidPublic

Description

Currently, the Toolforge cluster allow:

  • Administrators to login as root to all nodes
  • Toolforge users to login to all nodes (Grid & Kubernetes), besides just the bastions

This situation creates a few problems:

  • It's hard to audit who is using the root account
  • Users can run processes outside the standard channels (jsub/jstart & k8s deployments), which makes it hard to audit/account for as well

Event Timeline

Isn't the SSHd config standard across all Wikimedia machines? I think prod roots can log in there directly as root over SSH...

Toolforge users to login to all nodes (Grid & Kubernetes), besides just the bastions

No? https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/toolforge/infrastructure.pp#L13:

# Infrastructure instances are limited to an (arbitrarily picked) local
# service group and root.
security::access::config { 'labs-admin-only':
    content => "-:ALL EXCEPT (${::labsproject}.admin) root:ALL\n",
}

I think that includes k8s workers (they have the infrastructure banner).

That is included in k8s workers, redis, proxy, k8s master, grid masters, shadow masters, the docker registry, the builder host and the mail host. On those, Toolforge users cannot log in unless they are admins. The root login option is a bit trickier because that's done through cloud root or production root (which I think provides cloud root).

So yeah, users cannot log into any of those. That's why I have restricted grid admin hosts to the masters.

On the grid Host-based auth is a thing, and it is required for the grid to function. Again, this is why admin commands only work on the masters on the new grid.

As for root login directly, that's sort of standard at Wikimedia as long as you can do it through a bastion. I can confirm that there is a case where this is not so and if you are a root, you can root right on in there. I suspect this is true across bastions, though.

It has saved us many times in several ways, and the VMs hasn't had a very good "console" story so far, which is how people generally get around root ssh logins (or using jump hosts). The other way I've seen it dodged is by using an infrastructure that is especially reliable for authentication (like a well-managed AD over Centrify or sssd or something). I'm now curious about our LDAP setup.

If admins can be excepted from cgroup restrictions (as is being worked on) and the efforts to improve VM fixing pan out well, we could lock down root on toolforge? I personally kind of feel like it is a good thing, despite what the rest of the foundation does.

NOTE: @Bstorm learned about the infrastructure login restriction when trying to get into the grid master on toolsbeta for the first time 😂

Missed one: clush master also includes the infrastructure restriction. Only toolforge admins can log in. Does that account for everything? We cannot effectively restrict cronrunners for grid without disabling them. We could perhaps add it to services nodes! I don't think they have it, probably because they used to be part of the grid.

@Bstorm looks like we got it pretty much covered on the new grid, thanks for listing those restrictions here :)

Considering the feedback on IRC, changing our root policy will probably not be a welcome change unless we also change the overall Wikimedia root policy on this.

GTirloni triaged this task as Medium priority.