Page MenuHomePhabricator

Provision db1095 with at least 1 shard, sanitize and test slave-side triggers
Closed, ResolvedPublic

Description

db1095 will be a temporary sanitarium while we set up the new db1095 -> [labsdb1009, labsdb1010, labsdb1011] replication.

  • Probably 4 production shards can be done easily if copied from dbstore2001 (but it is not essential)
  • Run the sanitization script
  • Create the triggers
  • Enable slave-only row-based activated triggers and test they work as advertised on both STATEMENT, MIXED and ROW mode
  • Audit the tables
  • Finally, copy the result data to labsdb1009/10/11 and drop there the triggers

Event Timeline

Change 321735 had a related patch set uploaded (by Jcrespo):
Add new labs (sanitarium) host db1095

https://gerrit.wikimedia.org/r/321735

Change 321735 merged by Jcrespo:
prometheus-mysql-exporter: Add new labs (sanitarium) host db1095

https://gerrit.wikimedia.org/r/321735

The tests on T150960 are looking good so we'd need to discuss the next steps. This is what we have chatted about on IRC about next steps for db1095.

  • Import S3 via normal transfer (netcat for instance) to db1095
  • Import S1 moving its tablespace.
  • Sanitize both shards
  • Compress both shards
  • Transfer that via nc to the new labs servers
  • Keep importing shards (no need to do it on this first run)

Looks good. s3 will need lots of checking. Maybe I could work on an heuristic alert of private data?

Looks good. s3 will need lots of checking. Maybe I could work on an heuristic alert of private data?

That would be a massive win!!!

Candidates masters for db1095:

S3:
Depooled servers:
db1035
db1044

S1:
We can stick to db1052 (which needs the following 3 tables rebuilt if possible): categorylinks, objectcache, __wmf_checksums (as they have old timestamp formats)

Change 323504 had a related patch set uploaded (by Marostegui):
site.pp: db1044's binlog changed to ROW

https://gerrit.wikimedia.org/r/323504

Change 323505 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Added comment for db1044

https://gerrit.wikimedia.org/r/323505

Change 323525 had a related patch set uploaded (by Jcrespo):
[WIP] Create script to check that sanitarium filtering is working

https://gerrit.wikimedia.org/r/323525

Change 323504 merged by Marostegui:
site.pp: Change db1044 binlog to ROW

https://gerrit.wikimedia.org/r/323504

Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:03:58Z] <marostegui> Stop MySQL on db1044 - (depooled) maintenance - T150802

Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:08:04Z] <marostegui> Stop MySQl on db1095 - maintenance T150802

The transfer of s3 has started, from db1044 to db1095.

Change 323505 merged by jenkins-bot:
db-eqiad.php: Added comment for db1044

https://gerrit.wikimedia.org/r/323505

Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:18:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Added comments to db1044 status - T150802 (duration: 00m 45s)

  • data transferred
  • ran mysql_upgrade on it all good
  • replication is now flowing on ROW based replication

pending: enable pt-heartbeat on db1044

I am going to import s1 (enwiki) now.

Mentioned in SAL (#wikimedia-operations) [2016-11-28T12:35:26Z] <marostegui> Stop replication db1052 (depooled) - maintenance - T150802

S3 and S1 are now replicating in db1095.
There was some issues when replicating ROW based and the heartbeat table, that @jcrespo fixed (maybe we can post it here what happened and the fix? just for the record?)

I will leave this replicating over night, and tomorrow early in the morning and if all worked fine during the night I will start the sanitization script for enwiki only.

Basically, the heartbeat table is shared between shards, so the replace command fails twice, one per dc master and shard as there is no previous row. heartbeat is a complex replication setup because it is the only table updated by several dcs at the same time (so it is easy to control replication from several shards with a single SELECT. heartbeat actually forces replication to happen on statement-based replication to overwrite whatever is before, but that did not work in this case because the row based replication is from a second-tier slave (db1044 converted it to a row update). Normally, if this was a direct slave of the master, this issue wouldn't happen.

To avoid problems next time, there are 2 ways:

  • copy the 2 shard rows consistently with the backup, if such a thing is possible

or (specially, if the problem already was created):

  • Just insert 2 random rows with the right server_ids (the ones from each master on each dc)

Mentioned in SAL (#wikimedia-operations) [2016-11-29T06:41:13Z] <marostegui> Stopping replication db1095 - s1 instance for maintenance - T150802

@jcrespo thanks for the detailed explanation! :)

The replication for both threads had no issues during the night so what I have done:

  • stopped replication on s1 channel
  • from a screen called 23347.sanitize_db1095 on neodymium I am running the sanitize script only on enwikidatabase.

Change 324040 had a related patch set uploaded (by Jcrespo):
[WIP] labsdb: Move repo from views to common; add private data check

https://gerrit.wikimedia.org/r/324040

Change 323525 merged by Jcrespo:
Create script to check that sanitarium filtering is working

https://gerrit.wikimedia.org/r/323525

@jcrespo thanks for the detailed explanation! :)

The replication for both threads had no issues during the night so what I have done:

  • stopped replication on s1 channel
  • from a screen called 23347.sanitize_db1095 on neodymium I am running the sanitize script only on enwikidatabase.

I've done a quick restart to update replication filters for T151752. I checked no background operation was running, in particular the above screen, or any other alter table. I've left everything how I found it: s1 stopped (all threads), s3 running.

Change 324040 merged by Jcrespo:
labsdb: Move repo from views to common; add private data check

https://gerrit.wikimedia.org/r/324040

Change 324208 had a related patch set uploaded (by Jcrespo):
Migrate redact_sanitarium.sh script from software to puppet

https://gerrit.wikimedia.org/r/324208

Change 324208 merged by Jcrespo:
Migrate redact_sanitarium.sh script from software to puppet

https://gerrit.wikimedia.org/r/324208

This now "just works" (TM)- no need for passwords, hosts or anything:

db1095:~$ redact_sanitarium.sh testwiki
-- abuse_filter_log
-- afl_ip found
DROP TRIGGER IF EXISTS testwiki.abuse_filter_log_insert;
...

This now "just works" (TM)- no need for passwords, hosts or anything:

db1095:~$ redact_sanitarium.sh testwiki
-- abuse_filter_log
-- afl_ip found
DROP TRIGGER IF EXISTS testwiki.abuse_filter_log_insert;
...

Nice one!!!
We will test it with s3 which is still pending to run, so the script will have some database to take care of :-)

Note wiki list are already locally there, too. To get a list of public wikis:

db1095:~$ cd /usr/local/lib/mediawiki-config/dblists/
db1095:/usr/local/lib/mediawiki-config/dblists$ sort all.dblist private.dblist private.dblist | uniq -u

The following queries in enwiki gave no records so the first iteration of the sanitization went fine:

select * from abuse_filter_log where afl_ip !='';
select * from archive where ar_text !='';
select * from archive where ar_comment !='';
select * from mark_as_helpful where mah_system_type !='';
select * from mark_as_helpful where mah_user_agent !='';
select * from mark_as_helpful where mah_locale !='';
select * from recentchanges where rc_ip !='';
select * from revision where rev_text_id != 0;
select * from user where user_password !='';
select * from user where user_newpassword !='';
select * from user where user_email !='';
select * from user where user_options !='';
select * from user where user_token !='';
select * from user where user_email_authenticated !='';
select * from user where user_email_token !='';
select * from user where user_email_token_expires !='';
select * from user where user_newpass_time !='';

I have started replication on s1 and in a few hours will run them again to see if the triggers are working properly.

Mentioned in SAL (#wikimedia-operations) [2016-11-29T18:02:03Z] <marostegui> Stopping replication db1095 (new sanitarium, not in use) on s1 instance for maintenance - T150802

The server caught up and I can see new records coming in and being sanitized just fine for those tables listed in the triggers.
So I have stopped replication and started to compress enwiki.

I wanted to sanitize this for T151756, I realized the database hasn't been created yet because replication is stopped (I thought at first the script was broken).

I wanted to sanitize this for T151756, I realized the database hasn't been created yet because replication is stopped (I thought at first the script was broken).

Sorry about this, I stopped s3 replication yesterday at around 15:24(UTC) to help a bit with the IO so the compression on s1 would run faster. I thought I logged in on SAL but apparently I did not.

I have started replication on both threads as s1 finished its compression today at around 4AM.

I think we are ready to sanitize s3 now after dropping all the non private/used databases and tables. Shall I run the local redact_sanitarium.sh instead of the one we used to run from neodymium?

I wanted to sanitize this for T151756, I realized the database hasn't been created yet because replication is stopped (I thought at first the script was broken).

s3 caught up and fiwikivoyage is now present in db1095

mysql:root@localhost [(none)]> use fiwikivoyage;
Database changed
mysql:root@localhost [fiwikivoyage]>

Shall I run the local redact_sanitarium.sh instead of the one we used to run from neodymium?

It should be the same, except host changes needed to run on localhost just with a db parameter.

Mentioned in SAL (#wikimedia-operations) [2016-12-01T07:24:36Z] <marostegui> Stop replication db1095 (sanitarium2) on s3 instance - T150802

I have started to sanitize s3 using the local script in a local screen called: sanitize_s3

The script has finished. It took around 1:35h to finish. I am going to check how it went and start replication if everything is sanitized.

Change 324685 had a related patch set uploaded (by Jcrespo):
mariadb: ignore '' in private data check, print results as we get them

https://gerrit.wikimedia.org/r/324685

Change 324685 merged by Jcrespo:
mariadb: ignore '' in private data check, print results as we get them

https://gerrit.wikimedia.org/r/324685

The data has been sanitized correctly and I have started replication in s3.
We have identified some issues with the check_private_data.py script and the binary vs varbinary columns, but other than that the sanitization went fine and the data was correctly nuked.

The server caught up and the data is being sanitized as it comes in, so it looks good!
I have started the compression in a screen called: compression on localhost.
It will compress the top3 tables (revision, pagelinks, templatelinks) as we did here: T139055#2780291

Compression is done.
s1 and s3 are now compressed (1.3T in total).

I am going to start the transfer to labsdb1010 and once it is done I will load the events for it.

Let's load the events and close this as resolved.

Yes, I just wanted to load the events once the first data is copied over so I don't have to drop them in the labs servers and then load the labs ones.

Mentioned in SAL (#wikimedia-operations) [2016-12-02T11:39:42Z] <marostegui> Stop MySQL db1095 for maintenance - T150802

The data has been copied over.
events have been loaded.

I messed the data a bit when setting up the slave, I will transfer the data again (it only takes 1:30h). I prefer to start with a fresh copy than one that might be corrupted.

The data has been copied over and labsdb1010 is now catching up.