Page MenuHomePhabricator

Replicate & sanitize wikitech data
Closed, ResolvedPublic

Description

We are about to move wikitech onto a 'normal' DB server which (as I understand it) means it will automatically show up in dumps and replica views.

This task is to collect opinions about whether or not there are unique things we need to redact from wikitech that aren't already handled as standard bits of other wikis.

(from the db migration process doc:

Create replication filters on sanitarium hosts to avoid replicating wikitech (or maybe we do want to replicate it?) TO BE CONFIRMED with WMCS/Security?

Event Timeline

Before replication, @Reedy suggests that we drop all openstack* tables and all oai* tables.

To really speak on this, I'd have to look at the schema for wikitech for anything that doesn't match other mediawikis. The views on wikireplicas would only include tables expressly defined in the yaml config, so anything that does not exist on the main wikis would not get exposed.

I presume there is some sanitization of the user table on the sanitarium level, but at the wikireplicas, you only get this info (directly from the config):

user:
  source: user
  view: >
    select user_id, user_name, user_real_name, NULL as user_password, NULL as user_newpassword,
    NULL as user_email, NULL as user_options, NULL as user_touched, NULL as user_token,
    NULL as user_email_authenticated, NULL as user_email_token, NULL as user_email_token_expires,
    user_registration, NULL as user_newpass_time, user_editcount, NULL as user_password_expires
  where: (SELECT 1 from ipblocks where ipb_auto=0 AND ipb_deleted=1 AND ipb_user=user_id) is NULL

columns not in that will not be exposed.

Before replication, @Reedy suggests that we drop all openstack* tables and all oai* tables.

That sounds like the sort of thing the sanitarium does, but I don't know the details on that. @Marostegui would know.

Again, any tables that aren't expressly defined in the maintain-views.yaml are not exposed to users in mysql grants, though that doesn't mean there aren't other ways to lock them down.

For dumps, I don't know how those are generated at all. @ArielGlenn may have more useful comments on that.

I would also say: if anyone knows of columns in the user table or similar that are different than other wikis, it may be wise to expressly NULL them in the views if they are not public...just so that future iterations are aware they should not be seen

I can also add a comment in the views yaml for anything we must not ever expose (like an ldap table, for instance). This is presuming such things are not stripped out at the sanitarium.

For dumps, I don't know how those are generated at all. @ArielGlenn may have more useful comments on that.

There's a cron job. See: https://github.com/wikimedia/puppet/blob/production/modules/openstack/files/wikitech/mw-xml.sh

Before replication, @Reedy suggests that we drop all openstack* tables and all oai* tables.

That sounds like the sort of thing the sanitarium does, but I don't know the details on that. @Marostegui would know.

Again, any tables that aren't expressly defined in the maintain-views.yaml are not exposed to users in mysql grants, though that doesn't mean there aren't other ways to lock them down.

These are the replication filters we currently have on sanitarium hosts:

The filters format is: $database.$table

Do not replicate any table on these given databases:

replicate-wild-ignore-table = mysql.%
replicate-wild-ignore-table = oai.%

replicate-wild-ignore-table = advisorswiki.%
replicate-wild-ignore-table = arbcom_cswiki.%
replicate-wild-ignore-table = arbcom_dewiki.%
replicate-wild-ignore-table = arbcom_enwiki.%
replicate-wild-ignore-table = arbcom_fiwiki.%
replicate-wild-ignore-table = arbcom_nlwiki.%
replicate-wild-ignore-table = arbcom_ruwiki.%
replicate-wild-ignore-table = auditcomwiki.%
replicate-wild-ignore-table = boardgovcomwiki.%
replicate-wild-ignore-table = boardwiki.%
replicate-wild-ignore-table = chairwiki.%
replicate-wild-ignore-table = chapcomwiki.%
replicate-wild-ignore-table = checkuserwiki.%
replicate-wild-ignore-table = collabwiki.%
replicate-wild-ignore-table = ecwikimedia.%
replicate-wild-ignore-table = electcomwiki.%
replicate-wild-ignore-table = execwiki.%
replicate-wild-ignore-table = fdcwiki.%
replicate-wild-ignore-table = grantswiki.%
replicate-wild-ignore-table = id_internalwikimedia.%
replicate-wild-ignore-table = iegcomwiki.%
replicate-wild-ignore-table = ilwikimedia.%
replicate-wild-ignore-table = internalwiki.%
replicate-wild-ignore-table = legalteamwiki.%
replicate-wild-ignore-table = movementroleswiki.%
replicate-wild-ignore-table = noboard_chapterswikimedia.%
replicate-wild-ignore-table = officewiki.%
replicate-wild-ignore-table = ombudsmenwiki.%
replicate-wild-ignore-table = otrs_wikiwiki.%
replicate-wild-ignore-table = projectcomwiki.%
replicate-wild-ignore-table = searchcomwiki.%
replicate-wild-ignore-table = spcomwiki.%
replicate-wild-ignore-table = stewardwiki.%
replicate-wild-ignore-table = sysop_itwiki.%
replicate-wild-ignore-table = techconductwiki.%
replicate-wild-ignore-table = transitionteamwiki.%
replicate-wild-ignore-table = wg_enwiki.%
replicate-wild-ignore-table = wikimaniateamwiki.%
replicate-wild-ignore-table = zerowiki.%

Do not replicate any of these tables in any database:

replicate-wild-ignore-table = %.__wmf_checksums
replicate-wild-ignore-table = %.accountaudit_login
replicate-wild-ignore-table = %.arbcom1_vote
replicate-wild-ignore-table = %.archive_old
replicate-wild-ignore-table = %.blob_orphans
replicate-wild-ignore-table = %.blob_tracking
replicate-wild-ignore-table = %.bot_passwords
replicate-wild-ignore-table = %.bv2009_edits
replicate-wild-ignore-table = %.categorylinks_old
replicate-wild-ignore-table = %.click_tracking
replicate-wild-ignore-table = %.cu_changes
replicate-wild-ignore-table = %.cu_log
replicate-wild-ignore-table = %.cur
replicate-wild-ignore-table = %.discussiontools_subscription
replicate-wild-ignore-table = %.echo_email_batch
replicate-wild-ignore-table = %.echo_event
replicate-wild-ignore-table = %.echo_target_page
replicate-wild-ignore-table = %.echo_unread_wikis
replicate-wild-ignore-table = %.echo_notification
replicate-wild-ignore-table = %.echo_push_subscription
replicate-wild-ignore-table = %.edit_page_tracking
replicate-wild-ignore-table = %.email_capture
replicate-wild-ignore-table = %.exarchive
replicate-wild-ignore-table = %.exrevision
replicate-wild-ignore-table = %.globalnames
replicate-wild-ignore-table = %.growthexperiments_link_recommendations
replicate-wild-ignore-table = %.growthexperiments_link_submissions
replicate-wild-ignore-table = %.growthexperiments_mentor_mentee
replicate-wild-ignore-table = %.growthexperiments_mentee_data
replicate-wild-ignore-table = %.hidden
replicate-wild-ignore-table = %.image_old
replicate-wild-ignore-table = %.job
replicate-wild-ignore-table = %.linkscc
replicate-wild-ignore-table = %.localnames
replicate-wild-ignore-table = %.log_search
replicate-wild-ignore-table = %.logging_old
replicate-wild-ignore-table = %.long_run_profiling
replicate-wild-ignore-table = %.migrateuser_medium
replicate-wild-ignore-table = %.moodbar_feedback
replicate-wild-ignore-table = %.moodbar_feedback_response
replicate-wild-ignore-table = %.msg_resource
replicate-wild-ignore-table = %.oathauth_users
replicate-wild-ignore-table = %.oauth_accepted_consumer
replicate-wild-ignore-table = %.oauth_ratelimit_client_tier
replicate-wild-ignore-table = %.oauth_registered_consumer
replicate-wild-ignore-table = %.oauth2_access_tokens
replicate-wild-ignore-table = %.objectcache
replicate-wild-ignore-table = %.old_growth
replicate-wild-ignore-table = %.oldimage_old
replicate-wild-ignore-table = %.optin_survey
replicate-wild-ignore-table = %.prefstats
replicate-wild-ignore-table = %.prefswitch_survey
replicate-wild-ignore-table = %.profiling
replicate-wild-ignore-table = %.querycache
replicate-wild-ignore-table = %.querycache_info
replicate-wild-ignore-table = %.querycache_old
replicate-wild-ignore-table = %.querycachetwo
replicate-wild-ignore-table = %.reading_list
replicate-wild-ignore-table = %.reading_list_entry
replicate-wild-ignore-table = %.securepoll_cookie_match
replicate-wild-ignore-table = %.securepoll_elections
replicate-wild-ignore-table = %.securepoll_entity
replicate-wild-ignore-table = %.securepoll_lists
replicate-wild-ignore-table = %.securepoll_msgs
replicate-wild-ignore-table = %.securepoll_options
replicate-wild-ignore-table = %.securepoll_properties
replicate-wild-ignore-table = %.securepoll_questions
replicate-wild-ignore-table = %.securepoll_strike
replicate-wild-ignore-table = %.securepoll_voters
replicate-wild-ignore-table = %.securepoll_votes
replicate-wild-ignore-table = %.spoofuser
replicate-wild-ignore-table = %.text
replicate-wild-ignore-table = %.titlekey
replicate-wild-ignore-table = %.transcache
replicate-wild-ignore-table = %.translate_cache
replicate-wild-ignore-table = %.uploadstash
replicate-wild-ignore-table = %.urlshortcodes
replicate-wild-ignore-table = %.user_newtalk
replicate-wild-ignore-table = %.vote_log
replicate-wild-ignore-table = %.watchlist
replicate-wild-ignore-table = %.watchlist_expiry
replicate-wild-ignore-table = %.wikimedia_editor_tasks_counts
replicate-wild-ignore-table = %.wikimedia_editor_tasks_keys
replicate-wild-ignore-table = %.wikimedia_editor_tasks_targets_passed

Change 698718 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] realm.pp: Add ldap_domains table to the private list

https://gerrit.wikimedia.org/r/698718

@Andrew this should take care of excluding the ldap_domains table from replication, take a look whenever you can: https://gerrit.wikimedia.org/r/c/operations/puppet/+/698718/ it can be merged anytime (it needs mysql restart)

Change 698718 merged by Marostegui:

[operations/puppet@production] realm.pp: Add ldap_domains table to the private list

https://gerrit.wikimedia.org/r/698718

Mentioned in SAL (#wikimedia-operations) [2021-06-08T14:08:28Z] <marostegui> Restart sanitarium hosts (db2094, db2095, db1154, db1155) to pick up new filters T284106

The above patch is merged and replication restarted.
The following new filter is now in place:

replicate-wild-ignore-table = %.ldap_domains

With the above patch, this can be closed. I will create a task to track the views creation once the data has arrived to the clouddb* hosts.