Page MenuHomePhabricator

maintain-replicas.pl unmaintained, unmaintainable
Closed, ResolvedPublic

Description

The final step when setting up labs replication for a wiki is to run maintain-replicas.pl, which creates redacted views of the replicated data.

This script is currently unmaintained, and it has several issues:

  • It is written in Perl. Perl is a fine language, but it's not one we tend to use, so proficiency and tooling are scarce.
  • It is poorly-factored, which makes it hard to follow. The bulk of the work is done in file scope; explanatory comments are few and far in between; and the source language alternates between Perl and large blocks of SQL.
  • There is no separation of code and configuration. The list of tables to be mirrored and columns to expose or redact are all hard-coded. This hurts readability, and it is liable to break on future schema changes. Such breakage could result in leakage of private user data.
  • It makes faulty assumptions about the execution environment that are not always documented or correct. For example, the script attempts to load the credentials for the database from ~/.my.cnf, which it parses incorrectly.
  • All the work is done in a single-shot. The script does not pause or ask for confirmation before performing destructive operations. Because the work is not factored into discrete subroutines, there is no way to resume the work if the script is interrupted.
  • It blows up and rebuilds all wikis on every run. There is no way to run it for just one wiki. There is no "dry run" mode.

It should be rewritten, such that:

  • run-time configuration can be specified on the command-line, via the environment, or some other mean that does not involve making modifications to the script.
  • table and column definition and visibility settings are defined in separate configuration files
  • the script asks for confirmation before performing destructive operations.
  • users can preview operations with a "dry run" flag
  • users can limit the scope of operations to a subset of dbs.
  • the script is verbose about what it is doing

Ideally, the script will clean up after itself in case of failure, or at least provide some useful hints about what needs to be done.

Event Timeline

Krenair subscribed.

I wouldn't call this strictly blocked-on-ops yet (since although ops would have to approve the code, the next step is writing it and lots of other people could do that).

As @ori pointed out earlier in IRC, I actually already rewrote the meta_p parts in Python (see the .py file in operations/software.git maintain-replicas). We could rewrite the rest in Python too (and I think it should be easier to allow a subset of DBs if you don't have to deal with meta_p)

AlexMonk-WMF edited subscribers, added: AlexMonk-WMF; removed: Krenair.

I'm having a go at this.

It blows up and rebuilds all wikis on every run.
The script does not pause or ask for confirmation before performing destructive operations.

It truncates the meta_p.wiki table but it doesn't drop any wiki's database/tables.

Change 295607 had a related patch set uploaded (by Alex Monk):
[WIP/POC/POS] Add python version of maintain-replicas script

https://gerrit.wikimedia.org/r/295607

I have to add a view to a newly created labs-only table, so it is created for new wikis, too:

MariaDB LABS localhost enwiki_p > SHOW CREATE VIEW watchlist_count\G
*************************** 1. row ***************************
                View: watchlist_count
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `watchlist_count` AS select `enwiki`.`watchlist_count`.`watchers` AS `watchers`,`enwiki`.`watchlist_count`.`wl_namespace` AS `wl_namespace`,`enwiki`.`watchlist_count`.`wl_title` AS `wl_title` from `enwiki`.`watchlist_count`
character_set_client: utf8
collation_connection: utf8_general_ci
1 row in set (0.01 sec)

Should I try to send a patch to the old script or to the new one?

It blows up and rebuilds all wikis on every run.

It truncates the meta_p.wiki table but it doesn't drop any wiki's database/tables.

He meant the views.

Also, if I can criticize this without being useful so you can hate me?- I think we should change completely the model of this and make it closer to mediawiki: embrace migrations.

The reason why this was applied to all wikis/objects is because sometime a change had to be applied to all wikis (all ~900 of them), like the one I ask about on top-not just add new wikis. I think adding versioning and a migration to add new views when appropriate would simplify its handling and maintenance.

I would also add some smoke tests- check that after a change is applied, no user ips are available or hashed passwords are accessible. It doesn't need to be complete, only to detect easy mistakes. The problem with running only on localhost is that you can apply it to labsdb1001 and labsdb1003 and forget about labsdb1008 (this will be eventually labsdb1002's replacement).

In summary, I can offer to improve this, too, but cannot be its sole owner (I have already a lot of work maintaining the production side of filtering and the replication itself, replication filters, alter tables, etc.). And the main reason why we have filtering in 2 places is because this wasn't reliable enough.

It blows up and rebuilds all wikis on every run.

It truncates the meta_p.wiki table but it doesn't drop any wiki's database/tables.

He meant the views.

Yeah it does CREATE OR REPLACE I guess... :/

The reason why this was applied to all wikis/objects is because sometime a change had to be applied to all wikis (all ~900 of them), like the one I ask about on top-not just add new wikis. I think adding versioning and a migration to add new views when appropriate would simplify its handling and maintenance.

I'm struggling to understand exactly what this would look like in practice.

I would also add some smoke tests- check that after a change is applied, no user ips are available or hashed passwords are accessible. It doesn't need to be complete, only to detect easy mistakes.

Sounds like a good idea but let's do that in a separate task/commit.

The problem with running only on localhost is that you can apply it to labsdb1001 and labsdb1003 and forget about labsdb1008 (this will be eventually labsdb1002's replacement).

The script doesn't run only on localhost, it has a configurable list of labsdb servers to run on.

In summary, I can offer to improve this, too, but cannot be its sole owner (I have already a lot of work maintaining the production side of filtering and the replication itself, replication filters, alter tables, etc.). And the main reason why we have filtering in 2 places is because this wasn't reliable enough.

Who in the ops group could be its 'sole owner'? No one else has any access to these systems.

Who in the ops group could be its 'sole owner'? No one else has any access to these systems.

Maybe labs admins have something to say about how labs resources are used and accessed?

I'm 100% on board for being on the hook for this process, or at least being a partner. We can coparent :) The fall through the cracks outcome of T135029 in particular has a few layered reasons I think: task lacking description or context, that script was really difficult for me to understand not coming in with context on the use case, basically no documentation surrounding, I've been away for 2 weeks and that lead straight into Wikimania where I'm solo. This is just one of those historical things that is difficult.

But in general I want to help out here as a team.

So in addition to a rewrite, or reimagining, let's get a general outline doc going.

A few questions I have:

  • For this where was it run from? Was there a ~/.my.cnf file with creds that were used?
  • I have heard rumblings of a "new wiki setup check list" is that a thing now where we could add a note on this to the process? Maybe this?
  • @yuvipanda where is that etherpad we started braindumping on for this setup?

I once thought of a tool that does something like diff -u <(mysqldump --no-data) <(what-views-and-triggers-and-what-not-for-all-wikis-would-look-like), forged into some Puppet. Then Puppet would compare every 30 minutes whether and which views/triggers/etc. needed to be created/amended/deleted.

Also, @jcrespo is aware of this :-), but others might not: There is [[https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/redactatron|redactatron]] which isn't that well documented either.

@scfc redactatron is a horrible piece of software and we do not want to expand it, but kill it. It has its function on production filtering (which I maintain), but this is a different issue.

I have to add a view to a newly created labs-only table, so it is created for new wikis, too:

MariaDB LABS localhost enwiki_p > SHOW CREATE VIEW watchlist_count\G
*************************** 1. row ***************************
                View: watchlist_count
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `watchlist_count` AS select `enwiki`.`watchlist_count`.`watchers` AS `watchers`,`enwiki`.`watchlist_count`.`wl_namespace` AS `wl_namespace`,`enwiki`.`watchlist_count`.`wl_title` AS `wl_title` from `enwiki`.`watchlist_count`
character_set_client: utf8
collation_connection: utf8_general_ci
1 row in set (0.01 sec)

Should I try to send a patch to the old script or to the new one?

Obviously this 'Create View' query is generated by MySQL and doesn't appear to be as simple as it could be. Do you have a more simple example that would fit our current generated SQL query and can be just added into the config? If so we could put it in the existing script, replacing the existing watchlist_counts in there (that appears to be broken due to the watchlist table not being replicated) and then update my WIP commit to reflect the changes to the existing script?

@jcrespo: I was wrong in my last comment and have uploaded https://gerrit.wikimedia.org/r/295751 which, if I understand correctly, should do this.

  • For this where was it run from? Was there a ~/.my.cnf file with creds that were used?

@ori, please would you address this one?

  • I have heard rumblings of a "new wiki setup check list" is that a thing now where we could add a note on this to the process? Maybe this?

This Add_a_wiki page is effectively the 'checklist' that I've been using (and attempting to keep up to date), admittedly it is missing information about the process of getting labs replication and view going.

What sort of information are you looking for exactly? How the views are setup or how replication is configured? What server does what?
I'm sure that together we have a complete understanding of the systems, I'm not convinced that we all each understand the entirety (for example I'm not familiar with how MySQL replication works in detail - Jaime is though - but I am now very familiar with how the script to maintain the views works to turn the configured views into actual SQL CREATE DATABASE/CREATE OR REPLACE DEFINER=viewmaster VIEW statements etc.).

@scfc BTW, I actually documented redactatron but that is offtopic from this issue (it is only related to sanitarium filtering).

  • For this where was it run from? Was there a ~/.my.cnf file with creds that were used?

@ori, please would you address this one?

I ran it from osmium, but it's not relevant -- https://gerrit.wikimedia.org/r/#/c/295564/ (which is the version I ran) includes a change to read the credentials from the process environment instead of ~/.my.cnf, which the script was parsing incorrectly anyway.

I feel comfortable that https://gerrit.wikimedia.org/r/#/c/295607/ is a replication of https://github.com/wikimedia/operations-software/blob/master/maintain-replicas/maintain-replicas.pl

I did a few things to validate including diff -uw w/ the lists of full views and logging_whitelists. The custom views was a bit more convoluted to compare.

I created this to dump the array from perl to json:

https://phabricator.wikimedia.org/P4161

creating a file named json_custom_views_from_pl I then wrote this uglyness to get an idea of exactness in translation https://phabricator.wikimedia.org/P4162

The output is:

python compare_custom_views.py
24
MISMATCH
pl (log_deleted&4)=0 and log_type IN ('ARRAY(0x1807e78)')
py ANYTHING YOU WRITE HERE WILL BE OVERRIDDEN BY THE SCRIPT TO PULL FROM THE `logging_whitelist` CONFIG
-1
MISMATCH
pl (log_deleted&1)=0 and log_type IN ('ARRAY(0x1807e78)')
py ANYTHING YOU WRITE HERE WILL BE OVERRIDDEN BY THE SCRIPT TO PULL FROM THE `logging_whitelist` CONFIG
-1
MISMATCH
pl log_type IN ('ARRAY(0x1807e78)')
py ANYTHING YOU WRITE HERE WILL BE OVERRIDDEN BY THE SCRIPT TO PULL FROM THE `logging_whitelist` CONFIG
1

because I punted a bit on the logging_whitelist comparison leaving these 4 to look at manually and it seems fine. I have also run through testing this against an enwiki replica DB. I came up T147413 which has been fixed in the wild. I think we should merge the literal translation as is and then revise so the history is cleaner.

Tables I see in prod but not in test (enwiki -- and I believe this is mostly noise):

+aft_article_answer
+aft_article_answer_text
+aft_article_feedback
+aft_article_feedback_properties
+aft_article_feedback_ratings_rollup
+aft_article_feedback_select_rollup
+aft_article_field
+aft_article_field_group
+aft_article_field_option
+aft_article_filter_count
+aft_article_revision_feedback_ratings_rollup
+aft_article_revision_feedback_select_rollup
+article_feedback
+article_feedback_pages
+article_feedback_properties
+article_feedback_ratings
+article_feedback_revisions
+article_feedback_stats
+article_feedback_stats_types
+povwatch_log
+aft_article_answer
+aft_article_answer_text
+aft_article_feedback
+aft_article_feedback_properties
+aft_article_feedback_ratings_rollup
+aft_article_feedback_select_rollup
+aft_article_field
+aft_article_field_group
+aft_article_field_option
+aft_article_filter_count
+aft_article_revision_feedback_ratings_rollup
+aft_article_revision_feedback_select_rollup
+article_feedback
+article_feedback_pages
+article_feedback_properties
+article_feedback_ratings
+article_feedback_revisions
+article_feedback_stats
+article_feedback_stats_types
+povwatch_log
+povwatch_subscribers
+user_properties_anon

Of these +user_properties_anon is curios but I believe it's due to the missing meta_p db in my test setup.

Tables that exist in test and not prod:

-abuse_filter_history (addressed already)
-watchlist_count

I believe watchlist_count is explained by https://github.com/wikimedia/operations-software/commit/c886c32dcd94fb8937b016638036a1a6322ea95a

This is all pretty messy due to a few factors:

  • Manual changes have been made to prod outside of the maintain-replicas paradigm that were not backported or some may be incomplete in prod across wiki DBs if they were
  • Several versions of maintain-replicas.pl exist in the wild that were run at various stages. There doesn't seem to be a deploy strategy historically so this is a mystery to me currently.
  • There are safety modifications we need to make, such as addressing T147413. That make sense to couple with the python rewrite going forward. I think we should declare the perl version here a relic and thank @Krenair profusely for getting us out of it.
  • I have changes I want to make such as moving this to ops/puppet for deploy, managing the media-config checkout used as reference separately, moving to a scheme of local apply for each labsdb host, creating a user specifically for maintain-replicas work, and renaming to maintain-views (or some such). That makes more sense post-literal python translation to me rather than mixing thought streams.

So in essence, I am going to merge https://gerrit.wikimedia.org/r/#/c/295607/ but we still shouldn't run this against anything yet because it's a literal translation at the moment of a script that should not be run as well.

Change 295607 merged by Rush:
Add python version of maintain-replicas script

https://gerrit.wikimedia.org/r/295607

AlexMonk-WMF removed a subscriber: Krenair.

Chase is working on figuring out what else we need to do before we can run the script. https://gerrit.wikimedia.org/r/#/c/314305/ was done which was the most important difference between the script and current production state.

https://gerrit.wikimedia.org/r/#/c/314304/ was also done