Provision db1095 with at least 1 shard, sanitize and test slave-side triggers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Nov 15 2016, 9:09 PM

Description

db1095 will be a temporary sanitarium while we set up the new db1095 -> [labsdb1009, labsdb1010, labsdb1011] replication.

Probably 4 production shards can be done easily if copied from dbstore2001 (but it is not essential)
Run the sanitization script
Create the triggers
Enable slave-only row-based activated triggers and test they work as advertised on both STATEMENT, MIXED and ROW mode
Audit the tables
Finally, copy the result data to labsdb1009/10/11 and drop there the triggers

Details

Subject	Repo	Branch	Lines +/-
mariadb: ignore '' in private data check, print results as we get them	operations/puppet	production	+5 -4
Migrate redact_sanitarium.sh script from software to puppet	operations/puppet	production	+97 -0
labsdb: Move repo from views to common; add private data check	operations/puppet	production	+778 -19
Create script to check that sanitarium filtering is working	operations/puppet	production	+181 -0
db-eqiad.php: Added comment for db1044	operations/mediawiki-config	master	+1 -1
site.pp: Change db1044 binlog to ROW	operations/puppet	production	+9 -1
prometheus-mysql-exporter: Add new labs (sanitarium) host db1095	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• jcrespo	T50930 Database replication problems - production and labs (tracking)
Resolved	• jcrespo	T134203 enwiki_p replica on s1 is corrupted
Resolved	• jcrespo	T140788 Labs databases rearchitecture (tracking)
Resolved	• chasemp	T147052 Provision with data the new labsdb servers and provide replica service with at least 1 shard from a sanitized copy from production
Resolved	• jcrespo	T152194 Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095
Resolved	• Marostegui	T150802 Provision db1095 with at least 1 shard, sanitize and test slave-side triggers
Resolved	• Marostegui	T150960 Initial data tests for db1095 (temporary db1069 - sanitarium replacement)
Resolved	None	T151607 Rebuild old timestamp format tables

Event Timeline

• jcrespo created this task.Nov 15 2016, 9:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 15 2016, 9:09 PM

Change 321735 had a related patch set uploaded (by Jcrespo):
Add new labs (sanitarium) host db1095

https://gerrit.wikimedia.org/r/321735

gerritbot added a project: Patch-For-Review.Nov 15 2016, 9:13 PM

Change 321735 merged by Jcrespo:
prometheus-mysql-exporter: Add new labs (sanitarium) host db1095

https://gerrit.wikimedia.org/r/321735

• jcrespo moved this task from Triage to Pending comment on the DBA board.Nov 16 2016, 4:07 PM

• Marostegui created subtask T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement).Nov 17 2016, 4:04 PM

• jcrespo mentioned this in T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement).Nov 23 2016, 11:54 AM

ArielGlenn added a project: Datasets-General-or-Unknown.Nov 23 2016, 12:10 PM

ArielGlenn subscribed.

The tests on T150960 are looking good so we'd need to discuss the next steps. This is what we have chatted about on IRC about next steps for db1095.

Import S3 via normal transfer (netcat for instance) to db1095
Import S1 moving its tablespace.
Sanitize both shards
Compress both shards
Transfer that via nc to the new labs servers
Keep importing shards (no need to do it on this first run)

Looks good. s3 will need lots of checking. Maybe I could work on an heuristic alert of private data?

In T150802#2820324, @jcrespo wrote:

Looks good. s3 will need lots of checking. Maybe I could work on an heuristic alert of private data?

That would be a massive win!!!

Candidates masters for db1095:

S3:
Depooled servers:
db1035
db1044

S1:
We can stick to db1052 (which needs the following 3 tables rebuilt if possible): categorylinks, objectcache, __wmf_checksums (as they have old timestamp formats)

Change 323504 had a related patch set uploaded (by Marostegui):
site.pp: db1044's binlog changed to ROW

https://gerrit.wikimedia.org/r/323504

Change 323505 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Added comment for db1044

https://gerrit.wikimedia.org/r/323505

• Marostegui created subtask T151607: Rebuild old timestamp format tables.Nov 25 2016, 7:27 AM

• Marostegui changed the status of subtask T151607: Rebuild old timestamp format tables from Open to Stalled.Nov 25 2016, 10:38 AM

Change 323525 had a related patch set uploaded (by Jcrespo):
[WIP] Create script to check that sanitarium filtering is working

https://gerrit.wikimedia.org/r/323525

We have to remember to load the events at https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/events_sanitarium.sql and https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/events_labsdb.sql until he have a better alternative.

Change 323504 merged by Marostegui:
site.pp: Change db1044 binlog to ROW

https://gerrit.wikimedia.org/r/323504

Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:03:58Z] <marostegui> Stop MySQL on db1044 - (depooled) maintenance - T150802

Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:08:04Z] <marostegui> Stop MySQl on db1095 - maintenance T150802

The transfer of s3 has started, from db1044 to db1095.

Change 323505 merged by jenkins-bot:
db-eqiad.php: Added comment for db1044

https://gerrit.wikimedia.org/r/323505

Mentioned in SAL (#wikimedia-operations) [2016-11-28T07:18:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Added comments to db1044 status - T150802 (duration: 00m 45s)

data transferred
ran mysql_upgrade on it all good
replication is now flowing on ROW based replication

pending: enable pt-heartbeat on db1044

I am going to import s1 (enwiki) now.

Mentioned in SAL (#wikimedia-operations) [2016-11-28T12:35:26Z] <marostegui> Stop replication db1052 (depooled) - maintenance - T150802

• jcrespo closed subtask T150960: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) as Resolved.Nov 28 2016, 4:38 PM

S3 and S1 are now replicating in db1095.
There was some issues when replicating ROW based and the heartbeat table, that @jcrespo fixed (maybe we can post it here what happened and the fix? just for the record?)

I will leave this replicating over night, and tomorrow early in the morning and if all worked fine during the night I will start the sanitization script for enwiki only.

Basically, the heartbeat table is shared between shards, so the replace command fails twice, one per dc master and shard as there is no previous row. heartbeat is a complex replication setup because it is the only table updated by several dcs at the same time (so it is easy to control replication from several shards with a single SELECT. heartbeat actually forces replication to happen on statement-based replication to overwrite whatever is before, but that did not work in this case because the row based replication is from a second-tier slave (db1044 converted it to a row update). Normally, if this was a direct slave of the master, this issue wouldn't happen.

To avoid problems next time, there are 2 ways:

copy the 2 shard rows consistently with the backup, if such a thing is possible

or (specially, if the problem already was created):

Just insert 2 random rows with the right server_ids (the ones from each master on each dc)

Mentioned in SAL (#wikimedia-operations) [2016-11-29T06:41:13Z] <marostegui> Stopping replication db1095 - s1 instance for maintenance - T150802

@jcrespo thanks for the detailed explanation! :)

The replication for both threads had no issues during the night so what I have done:

stopped replication on s1 channel
from a screen called 23347.sanitize_db1095 on neodymium I am running the sanitize script only on enwikidatabase.

Change 324040 had a related patch set uploaded (by Jcrespo):
[WIP] labsdb: Move repo from views to common; add private data check

https://gerrit.wikimedia.org/r/324040

• jcrespo moved this task from Pending comment to In progress on the DBA board.Nov 29 2016, 10:25 AM

Change 323525 merged by Jcrespo:
Create script to check that sanitarium filtering is working

https://gerrit.wikimedia.org/r/323525

In T150802#2829736, @Marostegui wrote:

@jcrespo thanks for the detailed explanation! :)

The replication for both threads had no issues during the night so what I have done:

stopped replication on s1 channel

from a screen called 23347.sanitize_db1095 on neodymium I am running the sanitize script only on enwikidatabase.

I've done a quick restart to update replication filters for T151752. I checked no background operation was running, in particular the above screen, or any other alter table. I've left everything how I found it: s1 stopped (all threads), s3 running.

Change 324040 merged by Jcrespo:
labsdb: Move repo from views to common; add private data check

https://gerrit.wikimedia.org/r/324040

Change 324208 had a related patch set uploaded (by Jcrespo):
Migrate redact_sanitarium.sh script from software to puppet

https://gerrit.wikimedia.org/r/324208

Change 324208 merged by Jcrespo:
Migrate redact_sanitarium.sh script from software to puppet

https://gerrit.wikimedia.org/r/324208

This now "just works" (TM)- no need for passwords, hosts or anything:

db1095:~$ redact_sanitarium.sh testwiki
-- abuse_filter_log
-- afl_ip found
DROP TRIGGER IF EXISTS testwiki.abuse_filter_log_insert;
...

In T150802#2831241, @jcrespo wrote:
This now "just works" (TM)- no need for passwords, hosts or anything:
db1095:~$ redact_sanitarium.sh testwiki
-- abuse_filter_log
-- afl_ip found
DROP TRIGGER IF EXISTS testwiki.abuse_filter_log_insert;
...

Nice one!!!
We will test it with s3 which is still pending to run, so the script will have some database to take care of :-)

Note wiki list are already locally there, too. To get a list of public wikis:

db1095:~$ cd /usr/local/lib/mediawiki-config/dblists/
db1095:/usr/local/lib/mediawiki-config/dblists$ sort all.dblist private.dblist private.dblist | uniq -u

The following queries in enwiki gave no records so the first iteration of the sanitization went fine:

select * from abuse_filter_log where afl_ip !='';
select * from archive where ar_text !='';
select * from archive where ar_comment !='';
select * from mark_as_helpful where mah_system_type !='';
select * from mark_as_helpful where mah_user_agent !='';
select * from mark_as_helpful where mah_locale !='';
select * from recentchanges where rc_ip !='';
select * from revision where rev_text_id != 0;
select * from user where user_password !='';
select * from user where user_newpassword !='';
select * from user where user_email !='';
select * from user where user_options !='';
select * from user where user_token !='';
select * from user where user_email_authenticated !='';
select * from user where user_email_token !='';
select * from user where user_email_token_expires !='';
select * from user where user_newpass_time !='';

I have started replication on s1 and in a few hours will run them again to see if the triggers are working properly.

Mentioned in SAL (#wikimedia-operations) [2016-11-29T18:02:03Z] <marostegui> Stopping replication db1095 (new sanitarium, not in use) on s1 instance for maintenance - T150802

The server caught up and I can see new records coming in and being sanitized just fine for those tables listed in the triggers.
So I have stopped replication and started to compress enwiki.

I wanted to sanitize this for T151756, I realized the database hasn't been created yet because replication is stopped (I thought at first the script was broken).

In T150802#2836731, @jcrespo wrote:

I wanted to sanitize this for T151756, I realized the database hasn't been created yet because replication is stopped (I thought at first the script was broken).

Sorry about this, I stopped s3 replication yesterday at around 15:24(UTC) to help a bit with the IO so the compression on s1 would run faster. I thought I logged in on SAL but apparently I did not.

I have started replication on both threads as s1 finished its compression today at around 4AM.

I think we are ready to sanitize s3 now after dropping all the non private/used databases and tables. Shall I run the local redact_sanitarium.sh instead of the one we used to run from neodymium?

In T150802#2836731, @jcrespo wrote:

I wanted to sanitize this for T151756, I realized the database hasn't been created yet because replication is stopped (I thought at first the script was broken).

s3 caught up and fiwikivoyage is now present in db1095

mysql:root@localhost [(none)]> use fiwikivoyage;
Database changed
mysql:root@localhost [fiwikivoyage]>

Shall I run the local redact_sanitarium.sh instead of the one we used to run from neodymium?

It should be the same, except host changes needed to run on localhost just with a db parameter.

Mentioned in SAL (#wikimedia-operations) [2016-12-01T07:24:36Z] <marostegui> Stop replication db1095 (sanitarium2) on s3 instance - T150802

I have started to sanitize s3 using the local script in a local screen called: sanitize_s3

The script has finished. It took around 1:35h to finish. I am going to check how it went and start replication if everything is sanitized.

Change 324685 had a related patch set uploaded (by Jcrespo):
mariadb: ignore '' in private data check, print results as we get them

https://gerrit.wikimedia.org/r/324685

Change 324685 merged by Jcrespo:
mariadb: ignore '' in private data check, print results as we get them

https://gerrit.wikimedia.org/r/324685

The data has been sanitized correctly and I have started replication in s3.
We have identified some issues with the check_private_data.py script and the binary vs varbinary columns, but other than that the sanitization went fine and the data was correctly nuked.

The server caught up and the data is being sanitized as it comes in, so it looks good!
I have started the compression in a screen called: compression on localhost.
It will compress the top3 tables (revision, pagelinks, templatelinks) as we did here: T139055#2780291

• Marostegui added a parent task: T152194: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 .Dec 2 2016, 10:52 AM

Compression is done.
s1 and s3 are now compressed (1.3T in total).

I am going to start the transfer to labsdb1010 and once it is done I will load the events for it.

Let's load the events and close this as resolved.

• jcrespo assigned this task to • Marostegui.Dec 2 2016, 11:37 AM

Yes, I just wanted to load the events once the first data is copied over so I don't have to drop them in the labs servers and then load the labs ones.

Mentioned in SAL (#wikimedia-operations) [2016-12-02T11:39:42Z] <marostegui> Stop MySQL db1095 for maintenance - T150802

The data has been copied over.
events have been loaded.

• Marostegui closed this task as Resolved.Dec 2 2016, 3:23 PM

I messed the data a bit when setting up the slave, I will transfer the data again (it only takes 1:30h). I prefer to start with a fresh copy than one that might be corrupted.

• Marostegui reopened this task as Open.Dec 2 2016, 3:31 PM

The data has been copied over and labsdb1010 is now catching up.

ArielGlenn moved this task from Backlog to Done on the Datasets-General-or-Unknown board.Dec 7 2016, 9:35 AM

• Marostegui closed subtask T151607: Rebuild old timestamp format tables as Resolved.Jun 6 2017, 12:24 PM

Provision db1095 with at least 1 shard, sanitize and test slave-side triggersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Provision db1095 with at least 1 shard, sanitize and test slave-side triggers
Closed, ResolvedPublic
Actions

Related Objects
Search...