Page MenuHomePhabricator

Access to restbase / cassandra cluster
Closed, ResolvedPublic

Description

In our meeting this morning we agreed that the cassandra / restbase cluster will initially be run by services. This means that both @mobrovac and @GWicke will need sufficient access to the cluster to:

  • restart cassandra
  • temporarily disable puppet & test changes in the cassandra config on one of the boxes
  • restart restbase
  • attach the node debugger to restbase, perform heap dumps
  • run some root-only debug tools like netstat -p

The simplest way to support this (at least initially) is to give us local root on these boxes. Since we are going to be responsible for running this cluster, I think this is reasonable.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)

Setting priority to high as this blocks us from starting testing on prod hardware & Jessie.

Checked data.yaml and we have an existing "cassandra-test-roots"

66     description: users with root on cassandra hosts
67     members: [gwicke, ssastry, jdouglas, mobrovac]
68     privileges: ['ALL = (ALL) NOPASSWD: ALL']

What is the difference to that, since this ticket also says cassandra and testing.

I suppose we just need to make a similar one "restbase-roots" and apply that on the nodes.

gerritbot subscribed.

Change 190500 had a related patch set uploaded (by Dzahn):
create admin group restbase-roots

https://gerrit.wikimedia.org/r/190500

Patch-For-Review

@Dzahn, this is for the production cluster which we are just spinning up (T76986). Otherwise the structure is pretty much the same (only renamed cassandra-roots to cassandra-test-roots last week).

@GWicke alright, cassandra-roots vs. cassandra-test-roots makes sense now.

I think it makes sense for the services team to be able to debug this as root, yet we need auditing/logging/etc for system-level operations.

I am proposing root access through sudo but no root shell, this way we get free auditing and accountability through sudo without compromising debugging, testing, etc

to clarify, I am proposing that we allow all commands except shells to be executed through sudo with the intent to have an audit trail of what happened

I am proposing that we allow all commands except shells to be executed through sudo with the intent to have an audit trail of what happened

Seems like a sane idea. But, what do you mean specifically with shell? Are we talking about things like sudo su? I am fine with that, my concern is that we need to be able to run shell scripts as root too (so anything with a shebang really). Would that be allowed?

correct, interactive things like su, bash and so on. scripts with shebangs would be fine since the kernel interprets the shebang not sudo or userspace

Ah, now I see, you meant direct root access :) Ok, this looks good to me.

OK, first off let me say that I'm generally open to the idea of (local) root. We've already done this for Parsoid without much extra thinking. I should also note that shell/deployment access to the RESTBase machines to the people developing them is a given, there's no need to even discuss this.

That said, I have the following concerns:

  • "disable puppet" — that's a bad idea. Puppet being disabled is a problem; we should never do that as a matter of process (i.e. /planning/ for it is bad) for both code-review & maintainability reasons. This is especially of concern since you two can't +2 to Puppet, so you'd be even more inclined to disable puppet until "ops gets around to it".
  • "test changes in the cassandra config on one of the boxes" — also a bad idea. If we are to rely on Cassandra for something as essential as this, testing changes live on prod should not happen.
  • This is a new service, a new abstraction layer, two new pieces of software including a database & a new team, with no prior experience (as a whole) of interacting with us or running a Wikimedia production service. I'm worried that an additional degree of freedom here might create/reinforce a silo and distance service/ops apart.
  • Similarly, I'm worried that internally we'll end up, as a team, not giving the necessary priority and effort because "they can do it themselves". We've seen this happen before on multiple occasions, either successful or failed. As an example, the ElasticSearch worked like this for a while but now that team has been essentially dismantled and we're being called to perform maintenance tasks without having participated much in the earlier phases of the project (cf. T88354). This wasn't a fault of the search team's but the reality of them moving quickly and not strictly needing our help. It worked really well for a while but hasn't really left us in a well-maintained state.
  • The immediate availability of root essentially means that no effort is going to be placed on operating the service with an unprivileged user. Noone will care (or even notice) if logs are root-owned, for example. We've seen this before where Gabriel had root but the rest of the Parsoid team didn't, so some of the tasks were only possible by Gabriel and not the rest of the team, for no good reason. This essentially means that future access (e.g. to other, possibly future members of the services team) will also have to been provided with an "all or nothing" approach. I don't think this is good — I know containers will be mentioned as a counter-argument but I'm still a big fan of good ol' Unix permissions and I do think we should exhaust these to their limits.
  • Similarly, this also sets the precedent that all teams writing services will need root to operate them, which in turn means that services will be *designed* with root in mind. If the Services team can't set the good example, what can we expect of other foundation teams (e.g. the language engineering team)?
  • While we could give root with the catch of "please don't do this, that and the other", in practice this hasn't worked well in the past. It's easy to gradually introduce dependencies in your workflows and, frankly, it's easy to be sloppy for the infrastructure as a whole, especially when there's no real accountability on the team-level (there is accountability in the engineering department as a whole, but this means escalating an issue higher in the management chain which can be very messy).
  • Frankly, I don't think this "will initially be run by services" is ever going to change. Besides the incremental gradual workflow depedencies as mentioned above, the thing with extra permissions is that once they're given, it's impossible to revoke them without making a big deal out of it or even making the once-privileged user feel demoted and in the end, hurt or insulted.

Note that I haven't mentioned a single security-related reason so far (although there are a couple). That's because at this point of the deployment and standing of the Services team, I'm more worried about the impact on cross-team collaboration that this will have. If these concerns can be sufficiently addressed, I will personally have no objections.

@faidon, thank you for raising these important concerns (and taking the time to describe them so clearly and thoroughly). I'll try to chime in and give my two cents.

  • "disable puppet" — that's a bad idea. Puppet being disabled is a problem; we should never do that as a matter of process (i.e. /planning/ for it is bad) for both code-review & maintainability reasons. This is especially of concern since you two can't +2 to Puppet, so you'd be even more inclined to disable puppet until "ops gets around to it".

While this may seem like an evil request, I think the key word - temporarily - is missing from the request. As you know, we generally lack enough running-Cassandra-in-production knowledge. Thus, the request is more in sense of allow us to put out fires temporarily until a proper solution in accordance with Ops is found.

  • "test changes in the cassandra config on one of the boxes" — also a bad idea. If we are to rely on Cassandra for something as essential as this, testing changes live on prod should not happen.

In the general case I would agree 100% with you. This is a particular case, though. We replicate data in Cassandra, meaning that one server may be taken out easily (with the cost of re-sync later, ofc) and thus tested separately with the new config only on a percentage of the overall traffic. However, I do feel that these ideas should be discussed further with you guys (both now before going into prod, and later, should the need to test some config options arise).

  • This is a new service, a new abstraction layer, two new pieces of software including a database & a new team, with no prior experience (as a whole) of interacting with us or running a Wikimedia production service. I'm worried that an additional degree of freedom here might create/reinforce a silo and distance service/ops apart.

Yup. Thus our initial request to have someone from Ops work with us throughout the deployment cycle and beyond.

  • Similarly, I'm worried that internally we'll end up, as a team, not giving the necessary priority and effort because "they can do it themselves". We've seen this happen before on multiple occasions, either successful or failed. As an example, the ElasticSearch worked like this for a while but now that team has been essentially dismantled and we're being called to perform maintenance tasks without having participated much in the earlier phases of the project (cf. T88354). This wasn't a fault of the search team's but the reality of them moving quickly and not strictly needing our help. It worked really well for a while but hasn't really left us in a well-maintained state.

We realise you are already stretched thin as is, which increases the priority of your point. However, our goal is to provide other teams with the means to easily create services and prepare them for deployment, which cannot be done successfully without (regular) collaboration with the Ops team.

  • The immediate availability of root essentially means that no effort is going to be placed on operating the service with an unprivileged user. Noone will care (or even notice) if logs are root-owned, for example. We've seen this before where Gabriel had root but the rest of the Parsoid team didn't, so some of the tasks were only possible by Gabriel and not the rest of the team, for no good reason. This essentially means that future access (e.g. to other, possibly future members of the services team) will also have to been provided with an "all or nothing" approach. I don't think this is good — I know containers will be mentioned as a counter-argument but I'm still a big fan of good ol' Unix permissions and I do think we should exhaust these to their limits.
  • Similarly, this also sets the precedent that all teams writing services will need root to operate them, which in turn means that services will be *designed* with root in mind. If the Services team can't set the good example, what can we expect of other foundation teams (e.g. the language engineering team)?

I believe our request is unique. Mainly because, as you mentioned earlier, there is a lack of operational knowledge when it comes to Cassandra. Secondly, I think (and correct me if I'm wrong) the infrastructure we are putting in place now for RESTBase ought to host other services as well, thus easing their overall management.

With respect to other teams creating services needing root access, we are actually quite in favour of services not needing root rights (as we all know what kind of risks that exposes us to). In fact, we have started work on creating templates for future services (cf. T88585), the base of it being a supervisor module spawning service workers under unprivileged users.

  • While we could give root with the catch of "please don't do this, that and the other", in practice this hasn't worked well in the past. It's easy to gradually introduce dependencies in your workflows and, frankly, it's easy to be sloppy for the infrastructure as a whole, especially when there's no real accountability on the team-level (there is accountability in the engineering department as a whole, but this means escalating an issue higher in the management chain which can be very messy).
  • Frankly, I don't think this "will initially be run by services" is ever going to change. Besides the incremental gradual workflow depedencies as mentioned above, the thing with extra permissions is that once they're given, it's impossible to revoke them without making a big deal out of it or even making the once-privileged user feel demoted and in the end, hurt or insulted.

All valid points playing in favour of the idea that Services and Ops teams should be closely collaborating.

With regards to our general root access request, the general idea is to be able to react quickly if/when problems arise, as I think we can agree that there are plenty of unknowns regarding this deployment on both sides. Personally, though, I believe that having root access should not be regarded as a privilege, but as a necessity, and from my experience Ops should try to have more people in the latter category :)

@faidon, we all agree that this is a project that ops & services are doing together. Even more than with Parsoid, we'll be responsible for preventing & dealing with issues.

Such shared responsibility implies giving people the permissions to do their share of the devops work. Now, this does not necessarily need to be root. Realistically though, services like cassandra and others leverage unix users to separate services from each other & their own configs, data and logs. This is a good thing from a security pov, and we should keep doing that. It also typically implies that you need sudo (at least inside of a container) to restart services. I think this is the right trade-off, as our most important concern should be to have defense in depth against exploits rather than a vague and realistically not very effective attempt to defend against rogue engineers.

We should absolutely look into ways to minimize the capabilties needed to perform administrative tasks while maintaining cross-service isolation. We have been talking about unprivileged containers for a while. I'm looking forward to your ideas in that area.

In the meantime, I would appreciate if the restbase deploy wasn't delayed much further, as the VE performance project depends on its availability. There is a bunch of testing to be done before we can sanely declare this as ready for production, and we should start on that ASAP.

Finally, let me address some of your points individually:

Similarly, I'm worried that internally we'll end up, as a team, not giving the necessary priority and effort because "they can do it themselves".

This sounds more like a prioritization / resourcing issue. Using a lack of permissions to get engineers to bug opsens on IRC on a regular basis is not the right way to deal with that IMHO. The costs of doing it that way are significant.

Frankly, I don't think this "will initially be run by services" is ever going to change.

I do think that we should have more shared responsibility for running things in the longer term. How that should look in detail is something we need to figure out.

Hi folks, I'm generally supportive of Marko and Gabriel getting root on these machines, but I share many of Faidon's concerns.

Planning to use root to disable Puppet seems like a bad idea on the surface. Is this something that anyone in Ops that has +2 access to operations/puppet ever does?

Regarding a repeat of the ElasticSearch problem, I fully agree with Gabriel on this point. Withholding permissions is the wrong tool for the job. I think it would be interesting to talk to @Manybubbles about what the right tool for the job is, since I suspect there are things we could be doing now to mitigate the problems this might introduce.

Faidon brings up a very important point here:

  • The immediate availability of root essentially means that no effort is going to be placed on operating the service with an unprivileged user. Noone will care (or even notice) if logs are root-owned, for example.

Seems like an explicit mitigation plan should be in place for this. Ideas?

  • Similarly, this also sets the precedent that all teams writing services will need root to operate them, which in turn means that services will be *designed* with root in mind. If the Services team can't set the good example, what can we expect of other foundation teams (e.g. the language engineering team)?

Slippery slope arguments are a slippery slope :-) Let's evaluate each case on its merits.

One response to Marko:

With respect to other teams creating services needing root access, we are actually quite in favour of services not needing root rights (as we all know what kind of risks that exposes us to). In fact, we have started work on creating templates for future services (cf. T88585), the base of it being a supervisor module spawning service workers under unprivileged users.

I think Servisor needs a much wider vetting (perhaps an RFC?) This sounds like its on the track to be a core service used not only by RESTbase, but all services. I don't think this vetting needs to block a provisional deployment for RESTbase, but we need to treat its deployment as experimental until such a vetting happens.

The Servisor case is a possible example of the kind of thing that Faidon is talking about when he says "I'm worried that an additional degree of freedom here might create/reinforce a silo and distance service/ops apart." I'm not sure I would have fully understood that a generic service layer router was being deployed as part of this deployment had it not been part of this access request. With this, I'm better able to answer Giuseppe's last mail from February 5 on the "Investigating building an apps content service using RESTBase and Node.js" thread on Wikitech-l, but it bothers me greatly that that thread died on that message. That's especially because Tim's mail also made it clear he thought that Varnish should be our service routing layer. Tim may very well be wrong, but he at least deserved a response, and he'd be forgiven for thinking "they're going to do what they want regardless of what I think".

All that said, I don't think blocking a root request is the right thing to do, but we need to come up with effective communication mechanisms that don't involve access requests.

  • "disable puppet" — that's a bad idea. Puppet being disabled is a problem; we should never do that as a matter of process (i.e. /planning/ for it is bad) for both code-review & maintainability reasons. This is especially of concern since you two can't +2 to Puppet, so you'd be even more inclined to disable puppet until "ops gets around to it".

+1. I believe the only time we disable puppet on the Elasticsearch machines is when the service is hosed somehow. Like when the disk is busted. It would be better to support disabling the service on those machines with puppet reenabled

  • "test changes in the cassandra config on one of the boxes" — also a bad idea. If we are to rely on Cassandra for something as essential as this, testing changes live on prod should not happen.

Until we get a load testing environment I think we should work _something_ out.

  • Similarly, I'm worried that internally we'll end up, as a team, not giving the necessary priority and effort because "they can do it themselves". We've seen this happen before on multiple occasions, either successful or failed. As an example, the ElasticSearch worked like this for a while but now that team has been essentially dismantled and we're being called to perform maintenance tasks without having participated much in the earlier phases of the project (cf. T88354). This wasn't a fault of the search team's but the reality of them moving quickly and not strictly needing our help. It worked really well for a while but hasn't really left us in a well-maintained state.

Having Chad and I do the restarts was really good for giving us the sense of how the running system reacted to changes. My only regret here is not getting ops involved earlier. It would have helped with the cluster restart we're doing now, for example.

Funny thing, the only thing we ever use sudo for on those boxes is to update elasticsearch and bounce it.

  • The immediate availability of root essentially means that no effort is going to be placed on operating the service with an unprivileged user. Noone will care (or even notice) if logs are root-owned, for example. We've seen this before where Gabriel had root but the rest of the Parsoid team didn't, so some of the tasks were only possible by Gabriel and not the rest of the team, for no good reason. This essentially means that future access (e.g. to other, possibly future members of the services team) will also have to been provided with an "all or nothing" approach. I don't think this is good — I know containers will be mentioned as a counter-argument but I'm still a big fan of good ol' Unix permissions and I do think we should exhaust these to their limits.
  • Similarly, this also sets the precedent that all teams writing services will need root to operate them, which in turn means that services will be *designed* with root in mind. If the Services team can't set the good example, what can we expect of other foundation teams (e.g. the language engineering team)?

Would the plan be to add some sticky bit scripts into puppet for upgrading them and just never needing root at all? That sounds like a pretty good idea to me. Not because upgrades will be complicated but because the scripts would be documentation.

  • While we could give root with the catch of "please don't do this, that and the other", in practice this hasn't worked well in the past. It's easy to gradually introduce dependencies in your workflows and, frankly, it's easy to be sloppy for the infrastructure as a whole, especially when there's no real accountability on the team-level (there is accountability in the engineering department as a whole, but this means escalating an issue higher in the management chain which can be very messy).
  • Frankly, I don't think this "will initially be run by services" is ever going to change. Besides the incremental gradual workflow depedencies as mentioned above, the thing with extra permissions is that once they're given, it's impossible to revoke them without making a big deal out of it or even making the once-privileged user feel demoted and in the end, hurt or insulted.

Please, take my Elasticsearch permissions back. I don't know when I'll have time to babysit another cluster restart and that's all I use them for.

I'm all for getting ops more involved up front in service development. I do think that the team who owns the service should have to do some maintenance on it. Things like the Elasticsearch rolling restarts build a culture of shared suffering that prevents folks from throwing things over the wall. It also gives us more vitriol when complaining about bugs that make administration hard like Elasticsearch's super slow rolling restart.

Do what you want with root access. Just make sure to share the suffering.

Let me respond to the point on puppet & config changes:

"disable puppet" — that's a bad idea. Puppet being disabled is a problem; we should never do that as a matter of process (i.e. /planning/ for it is bad) for both code-review & maintainability reasons. This is especially of concern since you two can't +2 to Puppet, so you'd be even more inclined to disable puppet until "ops gets around to it".

We need a way to react quickly to issues. That might involve emergency changes to the configs, which without +2 rights can sometimes only be done by disabling puppet temporarily. As you know, puppet will otherwise revert the changes on the next run. And, let me clarify once more: Yes, disabling puppet should only be done in exceptional circumstances for good reasons. It is however important to have the option for the rare case where the service would otherwise be down.

It is sometimes also useful to evaluate small config tweaks (especially those that only affect performance) directly against regular boxes & before rolling them out to all nodes via puppet. The advantage of testing on a single machine only is that you can directly compare with the previous setting while receiving the same production traffic & using the same hardware and data. Again, this is a temporary thing on a short time scale. It should also be very rare, as we can test most such things (and anything that could break stuff) with synthetic traffic on the test cluster. And yes, in a perfect world we might have a replica test cluster & tee the full production traffic to it for testing. Not right now though.

I think Servisor needs a much wider vetting (perhaps an RFC?) This sounds like its on the track to be a core service used not only by RESTbase, but all services.

This is OT here, but let me respond anyway. Servisor is a small (318 lines) library I just wrote this weekend and not a service at all. It basically extracts some of the common boilerplate that's shared between parsoid, restbase & other *oids, generalizes it slightly & packages it so that it can be used by other services as well. The main benefit is in standardizing some things like command line parameters, logging, metrics & config file loading, so that we can build tools and procedures around those standard ways of doing things.

I'll kick off a discussion in the next days to get broader input from everybody involved.

supervisor module spawning service workers under unprivileged users

To clarify: This module does nothing about user rights, but leaves that to init systems as usual. The main benefit is that we can generate a standard init script / systemd unit as the actual runner interface is uniform.

Planning to use root to disable Puppet seems like a bad idea on the surface. Is this something that anyone in Ops that has +2 access to operations/puppet ever does?

Yes, especially during things like downtime and upgrades. Sometimes you need to do things in a very planned and specific order, which includes running one time commands that are not puppetized. The only way to be sure that your upgrade proceeds in the proper order is to disable puppet and do it yourself.

Alright, thanks to everyone who chimed in, I think I have a pretty good understanding where everyone's coming from. I'm okay with moving this forward. I'm also okay with making an exception to our process and not waiting until the next ops meeting on Monday, considering the combination of the US holiday, amount of time that has already passed and importance of the project. Apologies for the delay so far — in my defense this was a very last-minute request.

On the puppet front, I wasn't talking about disabling puppet for some limited time but more of... CRITICAL: Puppet last ran 6 days ago that is right now active for one of the RESTBase test boxes (which is fine, for a test box; not for prod).

I'd like the following strings attached:

  • Effort to be placed for services to be properly written to operate with the assumption that they can be developed/deployed without root. This means no root-owned logfiles, trebuchet shortcomings (e.g. wrt reloading a service after deployment) identified and filed as Phabricator tasks etc. etc. That's a matter of good security practice for our ecosystem as a whole, nothing that relates to trusting teams or individual developers.
  • Re: "need a way to react quickly to issues. That might involve emergency changes to the configs" — these should be coordinated with us. We need to be able to react without your help (as you won't do 24/7 obviously) and as, @Manybubbles eloquently put it, "share the suffering". If every time there's an issue we don't even hear about it, we'll never be able to respond as a team alone. Let's repeat the good parts of the ElasticSearch rollout without the bad parts :)
  • No unpuppetized, uncoordinated changes in the system config, including anything ranging from installing packages to Linux kernel tweaks, even if it's "until this puppet patch gets merged". Let's not repeat the "beta puppetmaster" drift but -even worse- without a commit trail. There are legitimate cases obviously (esp. during outages), apply common sense.
  • Not using those systems for anything else than they're intended for. I don't mean (just!) installing your IRC client of choice :), but also things such as running benchmarks. IOW, these are not the test boxes you're used to; these were quite noisy alert-wise but this can't remain like that.

Note how in each of the above I tried to give a concerte example of cases that have gone wrong in the past; this has nothing to do with "defend[ing] against rogue engineers".

That's all I have for now. I'll ignore the services/ops merge ideas for the purposes of this task and assume there's no agenda behind this :)

Thanks @faidon. Your points are all pretty uncontroversial and reflect established practice, so no problem there.

Change 190500 merged by Ottomata:
create admin group restbase-roots

https://gerrit.wikimedia.org/r/190500

Thank you, @Ottomata! For some reason the login doesn't work yet. Maybe just adding it in https://gerrit.wikimedia.org/r/#/c/190500/3/hieradata/role/common/restbase.yaml isn't enough?

Change 191508 had a related patch set uploaded (by Dzahn):
don't include restbase-roots in restbase.yaml

https://gerrit.wikimedia.org/r/191508

Patch-For-Review

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Conflicting value for admin::groups found in role cassandra
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Change 191508 merged by Dzahn:
don't include restbase-roots in restbase.yaml

https://gerrit.wikimedia.org/r/191508

Change 191512 had a related patch set uploaded (by Dzahn):
remove cassandra-test-roots

https://gerrit.wikimedia.org/r/191512

Patch-For-Review

On the puppet front, I wasn't talking about disabling puppet for some limited time but more of... CRITICAL: Puppet last ran 6 days ago that is right now active for one of the RESTBase test boxes (which is fine, for a test box; not for prod).

So i noticed something was wrong with the access request change because icinga was saying the puppet run broke on restbase1003, xenon and others.

Then, after merging a partial revert, i was wondering why restbase1004 and restbase1005 would not recover while restbase1003 did.

And looking at site.pp i see "node /^restbase100[1-6]\.eqiad\" and that made me think how is it even possible they show different results. That was a bit confusing until i noticed the duration in Icinga was > 2d and this must be completely unrelated and due to testing.

I fixed the puppet runs on 1004 and 1005 and got the restbase service to start again. details on T89922. not sure though how to prevent it from needing manual command.

Change 191512 merged by Filippo Giunchedi:
remove cassandra-test-roots

https://gerrit.wikimedia.org/r/191512

Change 191591 had a related patch set uploaded (by Filippo Giunchedi):
use restbase-roots with role restbase

https://gerrit.wikimedia.org/r/191591

Patch-For-Review

Change 191617 had a related patch set uploaded (by Filippo Giunchedi):
remove cassandra-test-roots group from cassandra role

https://gerrit.wikimedia.org/r/191617

Patch-For-Review

Change 191617 merged by Filippo Giunchedi:
remove cassandra-test-roots group from cassandra role

https://gerrit.wikimedia.org/r/191617

Change 191618 had a related patch set uploaded (by Filippo Giunchedi):
restbase: grant access to restbase-roots

https://gerrit.wikimedia.org/r/191618

Patch-For-Review

Change 191591 abandoned by Filippo Giunchedi:
use restbase-roots with role restbase

https://gerrit.wikimedia.org/r/191591

Change 191618 merged by Filippo Giunchedi:
restbase: grant access to restbase-roots

https://gerrit.wikimedia.org/r/191618

fgiunchedi claimed this task.

this should be fixed, access granted