Page MenuHomePhabricator

wikireplicas root access
Closed, ResolvedPublic

Description

This task is to explore how we can provide root access to the wikireplica dbs servers in a safe manner. when this was explored previously when this was discussed it was diced that root on theses systems meant root on all production dbs:

Giving root to labsdbs would be equivalent to giving root to all mysql servers for many reasons. No problem with that, but he should be added to the paging system (if he is not already there) and respond to the database alerts.

This was decided at the time to be too much of a risk. However we would like to explore further what theses risks are and if we could mitigate them and restrict users to just the wikireplicas dbs without leaking access to other production hosts. I thin k the best way forward would be:

DBA expand on what the issue is relating to root access and explore options to mitigate this risk enabling wmcs users to have full root on the wikireplicas
cloud-services-team define precisely what permissions are required but missing from the wikireplica hosts. this should allow us to better provision access if full root continues to be unviable

Event Timeline

@jbond As far as I know, the only thing that needs to be run as root within the wikireplicas hosts are the scripts to create the views/indexes (which Data Persistence isn't an owner of). Other than that, there's nothing else that requires root other than (stop/start mariadb and its replication, which requires a mysql prompt, which I guess we could just allow sudo for, in those specific hosts).

We as a team don't own this service, we just own (for now - until this is further clarified) some of its responsibilities, which include stop/start mariadb, upgrades and such and that does require root.
If there is a way to give root to WMCS (or whoever ends up owning this service) but NOT giving root to all db* (or cumin) hosts, I'd be okay with that. That being said, I'd like those having root, to respond to pages and/or assume full ownership of this service (which doesn't mean we, DBAs, won't help if there's a need).

@Marostegui thanks for the response

If there is a way to give root to WMCS (or whoever ends up owning this service) but NOT giving root to all db* (or cumin) hosts

Yes that should definitely be possible however

Giving root to labsdbs would be equivalent to giving root to all mysql servers for many reasons.

i was worried about this point originality brought up by @jcrespo

I'll leave ownership clarifications to @nskaggs i believe there is already an ongoing ownership discussion about this.

ill raise a separate ticket to discuss paging responsibilities however i think its reasonable request

@Marostegui thanks for the response

If there is a way to give root to WMCS (or whoever ends up owning this service) but NOT giving root to all db* (or cumin) hosts

Yes that should definitely be possible however

Giving root to labsdbs would be equivalent to giving root to all mysql servers for many reasons.

i was worried about this point originality brought up by @jcrespo

Not speaking on behalf of Jaime here but, giving my opinion of what I think what was the problem at the time:
At the time we didn't have much separation between the old hosts and production.
However, now, even the root password is different. And the data that arrive to clouddb* hosts is filtered, so having access to the mysql prompt of the replicas doesn't imply having access to any PII.

@Marostegui thanks for the response

If there is a way to give root to WMCS (or whoever ends up owning this service) but NOT giving root to all db* (or cumin) hosts

Yes that should definitely be possible however

Giving root to labsdbs would be equivalent to giving root to all mysql servers for many reasons.

i was worried about this point originality brought up by @jcrespo

Not speaking on behalf of Jaime here but, giving my opinion of what I think what was the problem at the time:
At the time we didn't have much separation between the old hosts and production.
However, now, even the root password is different. And the data that arrive to clouddb* hosts is filtered, so having access to the mysql prompt of the replicas doesn't imply having access to any PII.

Ahh great, will wait for confirmation from @jcrespo but i wondered if it was something like that and glad to here its sorted now :)

Yes, that was exactly what I meant back then. Not only that, passwords used to be written to the filesystem in plain text. Since then, most things may have changed and passwords have been removed in favor of other authentication methods (unix_socket) and passwords changed.

I am no longer involved in DBA work, so I don't know the details of the current state, and I have in high esteem the DBAs, but they are sometimes overloaded with work, and there has been many years of assuming only global roots can access mysql data that I am sure unintended data is still there. Why do I know this? Because I have been there and made those mistakes myself, some may be directly my fault- so this is not an accusation, but the recognition that it is hard to get right. In particular, I would suggest to ask additional input to @Ladsgroup as our grant checking expert in case there is something missing regarding realm separation.

Two examples:

  • I believe the replication password is still shared, which means there is access to production hosts through that account (all events, even private ones potentially still exposed).
  • Non-public data (such as suppressed edits or bans) is possibly still available to roots, just not sensitive one like passwords and ips. I asked a long time ago for an audit from security: T103011 but that has been pending for 8 years, so this is a not trustworthy in general.

Knowing this deficiencies I am not going to be the one that says there are not outstanding issues, but I won't block if someone can take responsibility for that risk.

For example, I've taken a host at random, clouddb1018, I've run:

root@clouddb1018[mysql]> select user, host, password FROM user WHERE user not rlike '[spu][0-9]+' LIMIT 10;

And I recognize a potential (unsure if it is the right one) mediawiki admin password on cloud db, among other (unsalted, easily reversed hash password). Even if that is fixed, who can ensure that is not going to happen again? What monitoring is in place? What filtering? Not only for MySQL, but for puppet secretes in general? Please note I am not trying to be hard to work with, I just want to expose deficiencies/long time held assumptions that may know be known that affect production and that it will be non-trivial (although not impossible) to overcome, specially as some may be result of emergency work under not the best of conditions!

jbond triaged this task as Medium priority.Aug 23 2023, 11:35 AM

Change 923681 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] wmcs: add wmcs-roots to roles where it is missing

https://gerrit.wikimedia.org/r/923681

Our grants are a mess, doubly so in cloud replicas. It's hard to actually remove those because we are not sure if they are actually needed or not. We have cleaned a lot but way more needs to be done. So first thing, I'd really like to remove mw-related grants and users first and then it should be safer to open root in cloud replicas. On top of that, I'd like to clean up some security stuff on two mysql users as well.

If the need is just to run maintain-views, we can add that to sudo policies of wmcs-roots.

Change 923681 merged by Jbond:

[operations/puppet@production] wmcs: add wmcs-roots to roles where it is missing

https://gerrit.wikimedia.org/r/923681

I think that members of wmcs-roots can now circumvent this by using the cloudcumin hosts, and run a command as root through Cumin.

I also discovered that a new group wikireplica-roots was created recently by @MoritzMuehlenhoff. At the moment it only includes @joanna_borun but it can be used if somebody else needs root access to clouddb hosts.

I would still like to give root access to clouddb* to everybody in wmcs-roots (which at the moment only includes WMF staff but could potentially include volunteers). In this way we can envision a future where WMCS SREs don't need global root at all.

I'll try recapping the concerns that were listed in the discussion above:

  • replication password is shared between clouddb and production hosts
  • other SQL passwords in clouddb hosts are shared with production hosts (mediawiki admin password and other unsalted, easily reversed hash passwords)
  • some users and grants in clouddb hosts need to be cleaned up
  • non-public data (such as suppressed edits or bans) are possibly available if you have root access to clouddbs

I'm not sure about the last one (non-public data), but I think the other concerns can only be addressed by the DBA team. Shall we create one or more DBA tasks to address these concerns?

I think that members of wmcs-roots can now circumvent this by using the cloudcumin hosts, and run a command as root through Cumin.

I also discovered that a new group wikireplica-roots was created recently by @MoritzMuehlenhoff. At the moment it only includes @joanna_borun but it can be used if somebody else needs root access to clouddb hosts.

I would still like to give root access to clouddb* to everybody in wmcs-roots (which at the moment only includes WMF staff but could potentially include volunteers). In this way we can envision a future where WMCS SREs don't need global root at all.

I'll try recapping the concerns that were listed in the discussion above:

While this concerns are valid and still an issue, I don't think they are all the same in terms of priority:

  • replication password is shared between clouddb and production hosts

This is not a super big deal, you cannot really do much with it.

  • other SQL passwords in clouddb hosts are shared with production hosts (mediawiki admin password and other unsalted, easily reversed hash passwords)

Neither wikiadmin nor wikiuser are replicated to clouddb* hosts - not sure which users do you have in mind?

  • some users and grants in clouddb hosts need to be cleaned up

You mean users as in volunteers users?

  • non-public data (such as suppressed edits or bans) are possibly available if you have root access to clouddbs

Non public data (like PII) shouldn't be there in the first place. We might have some public data which is restricted to certain queries (by the views). But this is by far the most important point and something we really need to think about before giving root user to everyone, especially non staff.

replication password is shared between clouddb and production hosts

This is not a super big deal, you cannot really do much with it.

This concern was raised by @jcrespo who wrote that "there is access to production hosts through that account (all events, even private ones potentially still exposed)". I think the concern is that you can theoretically use the replication password to replicate from a production host instead than from a sanitarium, unless we have some network firewall, but I'll let @jcrespo clarify.

other SQL passwords in clouddb hosts are shared with production hosts (mediawiki admin password and other unsalted, easily reversed hash passwords)

Neither wikiadmin nor wikiuser are replicated to clouddb* hosts - not sure which users do you have in mind?

This was a copy-paste from another comment by @jcrespo, I'll let him clarify.

some users and grants in clouddb hosts need to be cleaned up

You mean users as in volunteers users?

This one was me summarizing a previous comment from @Ladsgroup, who wrote "We have cleaned a lot but way more needs to be done"

non-public data (such as suppressed edits or bans) are possibly available if you have root access to clouddbs

this is by far the most important point

I've created a subtask for this specific concern: T368136: [wikireplicas] Make sure there is no sensitive data in clouddb hosts

Neither wikiadmin nor wikiuser are replicated to clouddb* hosts - not sure which users do you have in mind?

There were grants and users (wikiuser and wikiadmin) on cloud hosts until two weeks ago.

You mean users as in volunteers users?

Beside those, there are grants and production users in cloud where they shouldn't be. I mentioned wikiuser and wikiadmin but I have to go through all of them to make sure.

cloud-services-team define precisely what permissions are required but missing from the wikireplica hosts

If the need is just to run maintain-views, we can add that to sudo policies of wmcs-roots.

To clarify this point, the goal here is not to have more people running maintain-views, but to make clouddb* hosts less "special" and treat them as all the other cloud hosts that are managed by the WMCS team. That means that anyone in the wmcs-roots group can SSH as root and do any maintenance task on the host, including reboots, debugging logs, and in the future reimages too (though reimaging is blocked by T344412).

wmcs-roots is defined in admin/data/data.yaml and at the moment only includes WMCS staff, but I think it did include trusted volunteers in the past and could potentially include trusted volunteers in the future. If even after the various security improvements discussed in this task we are not confident that we can give clouddb* root access to wmcs-roots, then we should also remove access to clouddb* hosts through cloudcumin, because at the moment cloudcumin hosts can be used by members of the wmcs-roots group to run any command as root on clouddb*.

there are grants and production users in cloud where they shouldn't be. I mentioned wikiuser and wikiadmin but I have to go through all of them to make sure.

I created the subtask T368748: [wikireplicas] Review grants and views to track this specific topic.

cloud-services-team define precisely what permissions are required but missing from the wikireplica hosts

If the need is just to run maintain-views, we can add that to sudo policies of wmcs-roots.

To clarify this point, the goal here is not to have more people running maintain-views, but to make clouddb* hosts less "special" and treat them as all the other cloud hosts that are managed by the WMCS team. That means that anyone in the wmcs-roots group can SSH as root and do any maintenance task on the host, including reboots, debugging logs, and in the future reimages too (though reimaging is blocked by T344412).

I think that's a good approach - whoever has root should be comfortable to at least react to pages/alerts/triaging things for these hosts too.

If I may @fnegri, the issue is that those hosts are in a way special, because they are pieces (data) of production (meaning here mediawiki) on cloud realm, so it may not be easy to solve with the current architecture. If there was an implementation where absolutely all non-public data and configuration was deleted on production side (e.g. a message protocol that cleans up everything and reconstructs them again on cloud network), that would solve all concerns- but that would be way more complex and will require a lot of work. And only now there is the start of a proper inventory where each table and column will document its privacy and concerns for global usage and editing.

If I may @fnegri, the issue is that those hosts are in a way special

@jcrespo you absolutely may :) I'm starting to think that clouddb* hosts will remaing "special" until we completely reimplement replicas with a different architecture as you described.

It's probably still worth doing T368748: [wikireplicas] Review grants and views (it's up to DBAs really) but even after that subtask is done, it looks like there is an understandable desire to keep clouddb* hosts more restricted than other cloud* hosts.

If that's the case (I'm again deferring to DBAs here), then my preferred solution would be to document this decision in wikitech and in code comments in Puppet, and remove ::profile::base::cloud_production from clouddb* hosts (that's the class that allows root access from cloudcumins).

whoever has root should be comfortable to at least react to pages/alerts/triaging things for these hosts too.

@Marostegui personally I'm fine with requiring people with root access to clouddb* hosts to also respond to pages for those hosts.

Right now the wmcs-roots group does not coincide with people getting pages and we probably don't want to change that. If we decide to treat clouddb* hosts as "special" (as described above), my preferred solution would be not to give root access to wmcs-roots and instead give root access to the separate wikireplicas-roots group, documenting that whoever gets added to that group should also get pages for clouddb hosts.

There's one issue: that group currently includes only @joanna_borun and she's not getting pages. :) @joanna_borun do you still need root access to clouddb* hosts?

If I may @fnegri, the issue is that those hosts are in a way special

@jcrespo you absolutely may :) I'm starting to think that clouddb* hosts will remaing "special" until we completely reimplement replicas with a different architecture as you described.

Yes, they are special and will always remain like that for many reasons including:

  • Multi-instance
  • views
  • triggers
  • replication filters
  • non DC redundancy

It's probably still worth doing T368748: [wikireplicas] Review grants and views (it's up to DBAs really) but even after that subtask is done, it looks like there is an understandable desire to keep clouddb* hosts more restricted than other cloud* hosts.

We probably don't have much to say there but we can help. The only grants we've historically taken care of are the ones beloning to labsdb role - which would be an easy way to change but they are only SELECT.

If that's the case (I'm again deferring to DBAs here), then my preferred solution would be to document this decision in wikitech and in code comments in Puppet, and remove ::profile::base::cloud_production from clouddb* hosts (that's the class that allows root access from cloudcumins).

whoever has root should be comfortable to at least react to pages/alerts/triaging things for these hosts too.

@Marostegui personally I'm fine with requiring people with root access to clouddb* hosts to also respond to pages for those hosts.

Right now the wmcs-roots group does not coincide with people getting pages and we probably don't want to change that. If we decide to treat clouddb* hosts as "special" (as described above), my preferred solution would be not to give root access to wmcs-roots and instead give root access to the separate wikireplicas-roots group, documenting that whoever gets added to that group should also get pages for clouddb hosts.

That's is probably for WMCS to decide. I'd be fine either way as long as expectations are clear.

There's one issue: that group currently includes only @joanna_borun and she's not getting pages. :) @joanna_borun do you still need root access to clouddb* hosts?

I don't think we are expecting any SRE manager to ssh and troubleshoot issues anyway, so probably it is fine to leave her there if removing/creating a different group is time consuming. This is also for WMCS to decide how to organize or work - I don't have any strong opinions here.

Yes, they are special and will always remain like that for many reasons including:

I want to clarify that in my previous comment "special" was meant as "different from other cloud*" hosts and not as "different from other db*" hosts.

That's is probably for WMCS to decide. I'd be fine either way as long as expectations are clear.

My favourite solution would be to treat them as any other cloud* host and give root access to everyone in wmcs-roots, but from the comments above it looks like several people would prefer to keep a more restricted access (no root access for people without paging, no root access for NdA-volunteers). So my second favourite solution is to use the wikireplicas-roots group.

The downsides I can see in using wmcs-roots for root access:

  • more people get access to some private information (see T368136)
  • it gives root access to non-SREs that are not part of the on-call shifts

The downsides I can see in using wikireplicas-roots for root access:

Given there is not a consensus for using wmcs-roots, if there are no objections I will send patches to implement the wikireplicas-roots solution.

Change #1072755 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] R:wmcs::db::wikireplicas remove access from cloudcumin

https://gerrit.wikimedia.org/r/1072755

Change #1072755 merged by FNegri:

[operations/puppet@production] R:wmcs::db::wikireplicas remove access from cloudcumin

https://gerrit.wikimedia.org/r/1072755