Page MenuHomePhabricator

[wikireplicas] Make sure there is no sensitive data in clouddb hosts
Closed, DeclinedPublic

Description

We want to make sure there is no sensitive data stored in the databases hosted in clouddb* before we give root access to more people (see parent task).

Quoting from the discussion in the parent task:

non-public data (such as suppressed edits or bans) are possibly available if you have root access to clouddbs

Non public data (like PII) shouldn't be there in the first place. We might have some public data which is restricted to certain queries (by the views).

Event Timeline

What sort of data y'all are concerned about exposing to new roots on the replica db hosting nodes themselves? These boxes already expose less data than are exposed to those in the deployment, restricted, or analytics-privatedata-users groups by virtue of the Sanatarium sanitization steps that are semi-obviously not present on the un-redacted source hosts those users can reach with various query mechanisms.

This was linked in the parent task but I'm not sure if it's really a blocker here: T103011

What sort of data y'all are concerned about exposing to new roots on the replica db hosting nodes themselves? These boxes already expose less data than are exposed to those in the deployment, restricted, or analytics-privatedata-users groups by virtue of the Sanatarium sanitization steps that are semi-obviously not present on the un-redacted source hosts those users can reach with various query mechanisms.

I think in general not exposing more data that we should strictly should to. Also, the issue with root is that that user can make changes to replication, grants, stop/start processes and all that. I think the combination of how powerful root access is and not only for reading data - as you said just with deployment you can access the same (and more data), but with root you can pretty much do anything you want to the mariadb service.
I assume wmcs-roots is just WMCS staff and those would be the ones having root access?

I assume wmcs-roots is just WMCS staff and those would be the ones having root access?

wmcs-roots is defined in admin/data/data.yaml, at the moment it's only WMCS staff, but I think it did include trusted volunteers in the past and could potentially include trusted volunteers in the future.

Is volunteers vs staff a critical point? Would a volunteer be acceptable if they signed the NdA? What about contractors? I think we should start by defining the requirements for having root access to wikireplica hosts and to other WMCS hosts: ideally they would be the same requirements, but if we want "wikireplica roots" to have additional requirements that's also fine.

Also, the issue with root is that that user can make changes to replication, grants, stop/start processes and all that.

That is true, but also not clearly in the scope of this ticket which seems to be specifically about addressing claims of data privacy raised in other tasks.

That is true, but also not clearly in the scope of this ticket which seems to be specifically about addressing claims of data privacy raised in other tasks.

Agreed, let's keep the more general discussion in the parent task, and focus on which data we can remove (if any).

So in terms of data, my recap is:

  • root password is differentr from production
  • the data that is present there is sanitized and there's not data there that cannot be queried publicly. Although there's some data data there that we filter via the views and not only via sanitarium, but I guess that's fine
  • replication user password I don't think it is such a big deal as the risk of having someone to set up a new replica directly from production is minimum and there would be lots of others things that would need to be done for that to be successful.
  • wikiuser and wikiadmin are no longer there.
  • non-public data (such as suppressed edits or bans) are possibly available - but I don't know enough MW to be able to say if this is still doable or not @Ladsgroup would you know?

That last point would be the only one we need to clarify before moving to the parent task where we can discuss about the OS (and responsibility) implications of those having root.

there's some data data there that we filter via the views and not only via sanitarium, but I guess that's fine

Do you know what data, and why it's not filtered via sanitarium? I imagine it's easier to filter using views, but it should be possible to add a trigger in the sanitarium for any column that is currently filtered through the views.

there's some data data there that we filter via the views and not only via sanitarium, but I guess that's fine

Do you know what data, and why it's not filtered via sanitarium? I imagine it's easier to filter using views, but it should be possible to add a trigger in the sanitarium for any column that is currently filtered through the views.

The sanitarium layer drops/blanks tables and columns that should never be shown because of privacy/secrecy. Redaction and deletion in MediaWiki are generally soft actions controlled by bit field permissions. These are handled in the view layer generated by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/wmcs/db/wikireplicas/views/maintain-views.py and configured by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/wmcs/db/wikireplicas/maintain-views.yaml. The views for tables not listed in fullviews apply various WHERE clause and/or JOIN restrictions to omit data that is not available to the general public via the MediaWiki UI and/or API.

Some of these query time filtering restrictions are relatively elaborate. This is a direct side effect of exposing an operational schema to the replica users rather than a curated datamart of some kind.

actor:
  source: actor
  view: select actor_id, actor_user, actor_name
  where: >
    exists( select 1 from user where user_id = actor_user )
    AND ( select 1 from block join block_target on bt_id=bl_target where bl_deleted=1 AND bt_user=actor_user ) IS NULL
    OR exists( select 1 from archive where ar_actor = actor_id AND ar_deleted&4 = 0 )
    OR exists( select 1 from block where bl_by_actor = actor_id AND bl_deleted=0 )
    OR exists( select 1 from image where img_actor = actor_id )
    OR exists( select 1 from oldimage where oi_actor = actor_id AND oi_deleted&4 = 0 )
    OR exists( select 1 from filearchive where fa_actor = actor_id AND fa_deleted&4 = 0 )
    OR exists( select 1 from recentchanges where rc_actor = actor_id AND rc_deleted&4 = 0 )
    OR exists( select 1 from logging where log_actor = actor_id AND log_deleted&4 = 0 AND $INSERTED_EXPR$ )
    OR exists( select 1 from revision WHERE rev_actor = actor_id AND rev_deleted&4 = 0 )
archive:
  source: archive
  view: >
    select ar_id, ar_namespace, ar_title, NULL as ar_text, NULL as ar_comment_id,
    if(ar_deleted&4,0,ar_actor) as ar_actor, ar_timestamp, ar_minor_edit, NULL as ar_flags, ar_rev_id,                                                              ar_deleted, if(ar_deleted&1,null,ar_len) as ar_len,
    ar_page_id, ar_parent_id, if(ar_deleted&1,null,ar_sha1) as ar_sha1
archive_compat:
  source: archive
  view: >
    select ar_id, ar_namespace, ar_title, NULL as ar_text, NULL as ar_comment, NULL as ar_comment_id,                                                    
    case when ar_deleted&4 != 0 then null else COALESCE( actor_user, 0 ) END AS ar_user,
    case when ar_deleted&4 != 0 then null else actor_name END AS ar_user_text,
    if(ar_deleted&4,0,ar_actor) as ar_actor, ar_timestamp, ar_minor_edit, NULL as ar_flags, ar_rev_id,
    if(ar_deleted&1,null,content_id) as ar_text_id,
    ar_deleted, if(ar_deleted&1,null,ar_len) as ar_len,
    ar_page_id, ar_parent_id, if(ar_deleted&1,null,ar_sha1) as ar_sha1,
    if(ar_deleted&1,null,model_name) as ar_content_model,
    NULL as ar_content_format
  join:
    - table:
      - table: slots
      - type: JOIN
        table: slot_roles
        condition: on (slot_role_id = role_id AND role_name = 'main')
      - type: JOIN
        table: content
        condition: on (slot_content_id = content_id)
      - type: JOIN
        table: content_models
        condition: on (content_model = model_id)
      condition: on slot_revision_id = ar_rev_id
    - table: actor
      condition: on ar_actor = actor_id

So in terms of data, my recap is:

  • non-public data (such as suppressed edits or bans) are possibly available - but I don't know enough MW to be able to say if this is still doable or not @Ladsgroup would you know?

That last point would be the only one we need to clarify before moving to the parent task where we can discuss about the OS (and responsibility) implications of those having root.

Revision content does not exist in the replicas at all. Meta data for suppressed edits exists in the raw tables but is hidden from normal users by view layer restrictions as highlighted in T368136#9928727. The root db user would be able to see suppressed data by querying the underlying tables in the exact same way that sql enwiki would expose that data to folks in the deployment, restricted, and ops groups from the mwmaint* boxes.

I personally have no issue with giving root rights to people who have restricted or deployment rights in production (where they have way broader access to the data) but I don't feel comfortable about giving such access to general NDA (I don't think it's mentioned here but to make it clear).

I think several things here needs to untangled:

  • We need to review grants in the cloud. That is for general protection of privacy to make sure not anyone in the internet could access PII data we hid behind a flag (because of knowing the old wikiuser password). That's not really about root access where the users are NDA'ed and already have access to more private data via production boxes.
    • Someone with root could extract those hashes and use them but at the same time, if you have access to production boxes. there isn't much you can't already do.
  • We need to review the views rules. This is also the exact same thing as above, it's about general privacy and making sure non NDA'ed people won't have access to PII data. It's not about root either.
  • There are other risks associated with giving root access to cloud replicas, like access to hash of replication password, etc. but as Manuel said above. That is minimal and if someone has access to production boxes, they can do much more harm than the cloud replicas.

HTH

@bd808 @Ladsgroup thanks for your replies!

I will reiterate that the general goal is to make root access to clouddb* hosts as safe as possible. If we make root access to clouddb* safer, we can then feel more confident in giving this access to more people. Exactly who should gain access is something we can discuss separately in the parent task.

This task is about the additional data that you can access if you have root on clouddb* hosts, for example the additional data that is present in the database but is not available to end users, because it is filtered at the view layer. Can we somehow remove the data that is currently filtered at the view layer, and instead blank out this data at the sanitarium level?

Given the complex rules mentioned by @bd808 above, I assume it's not gonna be easy, but I wanted to double check if there is any technical solution that you can think of. If not, we can close this task as "Declined".

Re: reviewing grants and views rules, I created a separate task: T368748: [wikireplicas] Review grants and views

Just one addition: sanitarium hosts also have replication filters to exclude tables or entire databases (private wikis).

Can we somehow remove the data that is currently filtered at the view layer, and instead blank out this data at the sanitarium level?

We do that in many places already, as Manuel mentioned, private wikis are completely filtered. Individual columns also get fully filtered (e.g. user_password and user_email) but there are other cases that can't really be cleaned up via replication and honestly some are so complicated that are not possible to remove them. e.g. orphan rows of linktarget require left join with two of our largest tables (templatelinks and pagelinks). I highly doubt it'd be possible honestly for everything.

I highly doubt it'd be possible honestly for everything.

I tend to agree, I underestimated the amount of complex filtering that is handled by views.

I will close this task as "Declined", as apparently all the private/sensitive data that was possible to remove at the Sanitarium level is already removed, and what is left in clouddb* databases is very difficult to remove completely, at least with the current wikireplicas architecture.

The other thing that was mentioned in this task was cleaning up grants, but I split that into a separate task: T368748: [wikireplicas] Review grants and views.