Page MenuHomePhabricator

Initial data tests for db1095 (temporary db1069 - sanitarium replacement)
Closed, ResolvedPublic

Description

As Jaime and myself have discussed over hangouts, we should start testing all the above steps described in the parent task but with a small wiki so we can restart them if we find something that breaks:

  • Import enwikivoyage - 68G
  • Run the sanitization script
  • Create the triggers
  • Audit the tables

Event Timeline

enwikivoyage database has been imported.
I have placed the following flag in my.cnf as a temporary one: replicate-do-db=enwikivoyage (not added to puppet yet)
Replication is flowing fine for just that database

Mentioned in SAL (#wikimedia-operations) [2016-11-21T12:17:26Z] <marostegui> Stopping MySQL db1095 - maintenance - T150960

replicate-do-db has been changed in for: replicate-wild-do-table = enwikivoyage.%

@jcrespo looking at my notes, I have that the script to sanitize is: redact_standard_output.sh
Before running it tomorrow I will ping you anyways so we are on the same page about what is going to be run, but just wanted to make sure I was looking at the right one.

Yes, but there are private tables that have to be dropped (they are on the replication filters). Also, non-wiki databases have a different process.

Non-regular wikis? What does that refer to?

@Marostegui, is is replicating from s3-master? We should think a strategy to test ROW-based replication triggers (from dbstore1002? from another slave?).

Thanks - as per the replication filters, these tables woulld need to be dropped from enwikivoyage then:

click_tracking
cu_changes
cu_log
echo_email_batch
echo_event
echo_notification
filejournal
hidden
job
log_search
msg_resource
objectcache
pr_index
querycache
querycache_info
querycachetwo
securepoll_cookie_match
securepoll_elections
securepoll_entity
securepoll_lists
securepoll_msgs
securepoll_options
securepoll_properties
securepoll_questions
securepoll_strike
securepoll_voters
securepoll_votes
spoofuser
text
titlekey
transcache
uploadstash
user_newtalk
watchlist

@Marostegui, is is replicating from s3-master? We should think a strategy to test ROW-based replication triggers (from dbstore1002? from another slave?).

Yes, it is now replicating from s3 codfw master.

I have enabled GTID on it right now, so maybe we can move it under another slave in codfw (so we do not need to mess with eqiad) with ROW based replication.

Thanks - as per the replication filters, these tables woulld need to be dropped from enwikivoyage then:

That is the part I do not think it is 100% correct. Those tables are to be filtered, of course, but I think there could be others that should be filtered, too (what I've been repeating about whitelist rather than blacklist). A similar thing will happen with wikis. I would start by the ones that currently have views and then check its logical dependencies to create such a whitelist. @ArielGlenn (maybe Dan too) may help here, as I think he has been tasked to look into this at some point (will comment on the next meeting).

I have enabled GTID on it right now, so maybe we can move it under another slave in codfw

While that could be possible (I think it could too much overhead when we get more shards), that should only happen if we can guarantee the connection is TLS-encrypted.

jcrespo renamed this task from Initial data tests for db1095 to Initial data tests for db1095 (temporary db1069 - sanitarium replacement).Nov 21 2016, 2:24 PM

Thanks - as per the replication filters, these tables woulld need to be dropped from enwikivoyage then:

That is the part I do not think it is 100% correct. Those tables are to be filtered, of course, but I think there could be others that should be filtered, too (what I've been repeating about whitelist rather than blacklist). A similar thing will happen with wikis. I would start by the ones that currently have views and then check its logical dependencies to create such a whitelist. @ArielGlenn (maybe Dan too) may help here, as I think he has been tasked to look into this at some point (will comment on the next meeting).

Understood - makes sense!. Although for the initial first test run maybe we can leave the list as it is now to see if we find some other obvious errors. And once we have filtered everything we can start digging a bit more and tune it nicely.

I have enabled GTID on it right now, so maybe we can move it under another slave in codfw

While that could be possible (I think it could too much overhead when we get more shards), that should only happen if we can guarantee the connection is TLS-encrypted.

Yes, agreed. Keep in mind that I am treating this as a very first test to see what happens with the script (and as you pointed out, with row based replication and triggers).
The idea of doing the initial test with a codfw slave was basically to avoid changing stuff in eqiad right now.
(I just enabled SSL)

I have changed db1095 to replicate from db2057 which has ROW based binlogs:

MariaDB PRODUCTION s3 localhost (none) > show global variables like 'binlog_format';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW   |
+---------------+-------+
1 row in set (0.00 sec)

Mentioned in SAL (#wikimedia-operations) [2016-11-22T06:54:06Z] <marostegui> Stopping Replication on db1095 for maintenance - T150960

Mentioned in SAL (#wikimedia-operations) [2016-11-22T06:57:42Z] <marostegui> Stopping Replication on db2057 for maintenance - T150960

db1095 got replication broken last night (reminder: it was imported from dbstore2001):

2016-11-21 16:14:01 139805028624128 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table enwikivoyage.user_newtalk; Can't find record in 'user_newtalk', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db2057-bin.001239, end_log_pos 763383308, Gtid 0-171966669-3442783003, Internal MariaDB error code: 1032
2016-11-21 16:14:01 139805028624128 [Warning] Slave: Can't find record in 'user_newtalk' Error_code: 1032
2016-11-21 16:14:01 139805028624128 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2057-bin.001239' position 763383149; GTID position '0-171966669-3442783002'

So I have taken a snapshot from db2057 which is the server it will replicate from (with ROW based replication) and imported it. db1095 is replicating again with ROW based replication

I have run the first iteration of the script.
By the way, I have created a new script that bypasses the safety measures of redact_standard_output.sh. Not a big deal, it is called: /home/jynus/software/redactatron/scripts/redact_standard_output_T150960.sh

So this is what I ran:

root@neodymium:/home/jynus/software/redactatron/scripts# ./redact_standard_output_T150960.sh db1095 3306 enwikivoyage | mysql -h db1095 -P 3306 enwikivoyage

The dry run complained about a few tables, but that was fine (I assumed):

-- af_user_ip MISSING
-- mah_system_type MISSING
-- mah_user_agent MISSING
-- mah_locale MISSING
-- user_options MISSING

Triggers are now in place:

root@db1095:~# mysql -e "show triggers in enwikivoyage\G"
*************************** 1. row ***************************
             Trigger: abuse_filter_log_insert
               Event: INSERT
               Table: abuse_filter_log
           Statement: SET NEW.afl_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 2. row ***************************
             Trigger: abuse_filter_log_update
               Event: UPDATE
               Table: abuse_filter_log
           Statement: SET NEW.afl_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 3. row ***************************
             Trigger: archive_insert
               Event: INSERT
               Table: archive
           Statement: SET NEW.ar_text = '', NEW.ar_comment = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 4. row ***************************
             Trigger: archive_update
               Event: UPDATE
               Table: archive
           Statement: SET NEW.ar_text = '', NEW.ar_comment = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 5. row ***************************
             Trigger: recentchanges_insert
               Event: INSERT
               Table: recentchanges
           Statement: SET NEW.rc_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 6. row ***************************
             Trigger: recentchanges_update
               Event: UPDATE
               Table: recentchanges
           Statement: SET NEW.rc_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 7. row ***************************
             Trigger: revision_insert
               Event: INSERT
               Table: revision
           Statement: SET NEW.rev_text_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 8. row ***************************
             Trigger: revision_update
               Event: UPDATE
               Table: revision
           Statement: SET NEW.rev_text_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 9. row ***************************
             Trigger: user_insert
               Event: INSERT
               Table: user
           Statement: SET NEW.user_password = '', NEW.user_newpassword = '', NEW.user_email = '', NEW.user_token = '', NEW.user_email_authenticated = '', NEW.user_email_token = '', NEW.user_email_token_expires = '', NEW.user_newpass_time = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 10. row ***************************
             Trigger: user_update
               Event: UPDATE
               Table: user
           Statement: SET NEW.user_password = '', NEW.user_newpassword = '', NEW.user_email = '', NEW.user_token = '', NEW.user_email_authenticated = '', NEW.user_email_token = '', NEW.user_email_token_expires = '', NEW.user_newpass_time = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary

Key tables look sanitized in db1095:

mysql:root@localhost [enwikivoyage]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1095     |
+------------+
1 row in set (0.00 sec)
mysql:root@localhost [enwikivoyage]> SELECT * FROM recentchanges WHERE rc_ip != '';
Empty set (0.00 sec)

mysql:root@localhost [enwikivoyage]> SELECT * FROM user WHERE user_password != '';
Empty set (1.41 sec)

As those tables do show results in db1095's master (temporarily db2057)

MariaDB PRODUCTION s3 localhost enwikivoyage > select @@hostname;
+------------+
| @@hostname |
+------------+
| db2057     |
+------------+
1 row in set (0.00 sec)
MariaDB PRODUCTION s3 localhost enwikivoyage > SELECT count(*) FROM recentchanges WHERE rc_ip != '';
+----------+
| count(*) |
+----------+
|    27874 |
+----------+
1 row in set (0.05 sec)

MariaDB PRODUCTION s3 localhost enwikivoyage > SELECT count(*) FROM user WHERE user_password != '';
+----------+
| count(*) |
+----------+
|    18405 |
+----------+
1 row in set (1.10 sec)

I have checked the following list of table.column making sure that the column has no records or they have the ID set to what the log suggest:

abuse_filter_log.afl_ip - no records
archive.text - no records
archive.ar_comment - no records
archive.ar_text - no records
recentchanges rc_ip - new RECORDS are getting inserted with the column not sanitized (the triggers are inplace)
revision.rev_text_id - new RECORDS are getting inserted with a different value from 0
user.user_password - new RECORDS are getting inserted with the hashes
user.email - new RECORDS are getting inserted with the emails
user.XX - generally getting new records without applying the sanitization.

Looks like the triggers are in place, but not working.
It can be because of the ROW binlog format. I suggest we do the same procedure but with statement based replication binlogs.

MariaDB [(none)]> SHOW GLOBAL VARIABLES like 'slave_run_triggers_for_rbr';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| slave_run_triggers_for_rbr | NO    |
+----------------------------+-------+
1 row in set (0.01 sec)

Also, please upload a redact_standard_output.sh which does not bypass the checks, only add db1095 as a good sanitarium host.

MariaDB [(none)]> SHOW GLOBAL VARIABLES like 'slave_run_triggers_for_rbr';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| slave_run_triggers_for_rbr | NO    |
+----------------------------+-------+
1 row in set (0.01 sec)

Thanks

I have re-sanitized the DB to clean all the stuff and now:

mysql:root@localhost [enwikivoyage]> SHOW GLOBAL VARIABLES like 'slave_run_triggers_for_rbr';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| slave_run_triggers_for_rbr | YES   |
+----------------------------+-------+

Also, please upload a redact_standard_output.sh which does not bypass the checks, only add db1095 as a good sanitarium host.

Changed to only allow db1095.

and now
Changed to only allow db1095.

Please add me as reviewer of both changes.

Change 322855 had a related patch set uploaded (by Marostegui):
sanitarium2.my.cnf.erb: Enable triggers in RBR

https://gerrit.wikimedia.org/r/322855

Change 322855 merged by Marostegui:
sanitarium2.my.cnf.erb: Enable triggers in RBR

https://gerrit.wikimedia.org/r/322855

Thanks - as per the replication filters, these tables woulld need to be dropped from enwikivoyage then:

click_tracking
cu_changes
cu_log
echo_email_batch
echo_event
echo_notification
filejournal
hidden
job
log_search
msg_resource
objectcache
pr_index
querycache
querycache_info
querycachetwo
securepoll_cookie_match
securepoll_elections
securepoll_entity
securepoll_lists
securepoll_msgs
securepoll_options
securepoll_properties
securepoll_questions
securepoll_strike
securepoll_voters
securepoll_votes
spoofuser
text
titlekey
transcache
uploadstash
user_newtalk
watchlist

I have renamed those tables in db1095 to see if replication would break for any reason.

So far triggers are working fine, I see new records on recentchanges for instance, but the IP column is being sanitized correctly.

I would say that if replication keeps going fine, maybe we need to consider importing a full shard tomorrow? ideas?

And as soon as I sent that:

2016-11-22 14:04:55 139805000005376 [ERROR] Slave SQL: Error executing row event: 'Table 'enwikivoyage.watchlist' doesn't exist', Gtid 0-171966669-3445690675, Internal MariaDB error code: 1146
2016-11-22 14:04:55 139805000005376 [Warning] Slave: Table 'enwikivoyage.watchlist' doesn't exist Error_code: 1146
2016-11-22 14:04:55 139805000005376 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2057-bin.001241' position 894975756; GTID position '0-171966669-3445690673'

On my.cnf

replicate-wild-ignore-table = %.watchlist

I guess it is because of the explicit: Replicate_Wild_Do_Table: enwikivoyage.%

these tables woulld need to be dropped

I have some comments of some tables that I know about:

  • some of those tables are empty (or should be empty) on production, and they are not used like the job queue or the translation cache tables, as long as they are truncated we can replicate them
  • Others, like watchlist should be dropped, as they contain private information
  • We have to check the full wiki list (specially s3) as we will have to delete a lot of them for being private or not being wikis at all
  • We could do a table comparison with the original labs to avoid mistakes
  • Centralauth is the most problematic db, and it has its own filtering method
  • We may be missing other things

And as soon as I sent that:

We have to do some study:

http://dev.mysql.com/doc/refman/5.6/en/replication-rules-table-options.html

Yes, it is clear there, if there is a Replicate_Wild_Do_Table it will execute the event and forget about the ignore ones.
However, we have the replicate_wild_do_table only for testing as db1095 only has one DB of S3. That is why I was suggesting that maybe we can include a full shard (maybe S1 or S5?) and do the testing with the real my.cnf that we would probably have.

The problem is that, as do has preference over ignore, and ROW based and STATEMENT based behave differently, it is not easy to write a whitelist unless we explicitly allow every ok combination of database and table (which is not viable).

Should we stick to STATEMENT as of today (and as that is what we have today) instead of exploring other territories which we are not completely sure what will bring us?

That is why I was suggesting that maybe we can include a full shard (maybe S1 or S5?)

Sure, as long as it is not s3. Of course, that has the danger of breaking the dataset imported and having to reimport it again.

Should we stick to STATEMENT

No, STATEMENT will break things worse, and it is the almost the whole reason for the reimport. We can stick to a blacklist which is what we have now, though.

Change 322903 had a related patch set uploaded (by Marostegui):
site.pp: db1052's binlog changed to ROW

https://gerrit.wikimedia.org/r/322903

Change 322906 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1052 for maintenance

https://gerrit.wikimedia.org/r/322906

Change 322903 merged by Marostegui:
site.pp: db1052's binlog changed to ROW

https://gerrit.wikimedia.org/r/322903

Change 322906 merged by jenkins-bot:
db-eqiad.php: Depool db1052 for maintenance

https://gerrit.wikimedia.org/r/322906

Mentioned in SAL (#wikimedia-operations) [2016-11-22T16:11:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1052 - T150960 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2016-11-23T07:14:36Z] <marostegui> Stopping replication on db1052 (depooled) for maintenance - T150960

Mentioned in SAL (#wikimedia-operations) [2016-11-23T07:29:25Z] <marostegui> Stopping replication on db1095 (depooled) for maintenance - T150960

As we spoke yesterday.
I am using db1052 (which was depooled yesterday) to import S1's tablesspace to db1095 right now.

Before changing the binlog format to ROW, I checked just in case there was a slave connected to it, and I was surprised to see that dbstore1001 is a slave of db1052.
@jcrespo is this expected?

I haven't touched the binlog format until this is clarified (it is MIXED right now)

dbstore1001 doesn't use GTID, and it is a delayed slave that starts replication automatically, so it is not simple to migrate it from the old master (db1052) to the new one.

I will move the reminder dbstore1001 replication channels to the right master, hopefully not breaking anything on the way.

@ArielGlenn don't add yourself to this ticket, as it will be closed soon. Check T150802 instead.

db1052 and the others should be clear to be used.

All done, db1052 is now master of db1095 which is replicating ROW-based replication now. It is catching up with db1052 (which is also catching up with the master).
Once they are up to date, I will run the santization script.

There was some issues when importing three tables:

categorylinks
objectcache
__wmf_checksums

All related to the timestamp columns.

Schema mismatch (Column cl_timestamp precise type mismatch.)

Looks like only those tables are using the old timestamp format and were never rebuilt:

MariaDB PRODUCTION s1 localhost enwiki > select t.table_schema,t.engine,t.table_name,c.column_name,c.column_type from information_schema.tables t    inner join information_schema.columns c on c.table_schema = t.table_schema and c.table_name = t.table_name   left outer join information_schema.innodb_sys_tables ist on ist.name = concat(t.table_schema,'/',t.table_name)   left outer join information_schema.innodb_sys_columns isc on isc.table_id = ist.table_id and isc.name = c.column_name  where c.column_type in ('time','timestamp','datetime')   and t.table_schema not in ('mysql','information_schema','performance_schema', 'ops', 'sys')   and t.table_type = 'base table'   and (t.engine != 'innodb' or (t.engine = 'innodb' and isc.mtype = 6)) order by t.table_schema,t.table_name,c.column_name;
+--------------+--------+-----------------+--------------+-------------+
| table_schema | engine | table_name      | column_name  | column_type |
+--------------+--------+-----------------+--------------+-------------+
| enwiki       | InnoDB | categorylinks   | cl_timestamp | timestamp   |
| enwiki       | InnoDB | objectcache     | exptime      | datetime    |
| enwiki       | InnoDB | __wmf_checksums | ts           | timestamp   |
+--------------+--------+-----------------+--------------+-------------+
3 rows in set (0.09 sec)

I faced this issue in previous jobs and it was fixed by rebuilding the table. However, I have not done it here because the master (db1057) has the same issue and db1052 had dbstore1001 hanging from it (when I was doing the import and before Jaime moved dbstore)
As a temporary workaround I have dumped those tables and imported it in a normal way.

db1095 has now parallel replication disabled.

We also need to bear in mind that we are importing tables from 10.0 to 10.1

__wmf_checksums can be dropped at any time, it is the table I use for running pt-table-checksum, it is not used by mediawiki. I do not know what objectcache is, probably is unused. categorylinks probably has to be rebuilt. We can run a mysql_upgrade --force if that helps with something.

Mentioned in SAL (#wikimedia-operations) [2016-11-23T14:31:51Z] <marostegui> Stopping replication db1095 (not pooled) - maintenance - T150960

__wmf_checksums can be dropped at any time, it is the table I use for running pt-table-checksum, it is not used by mediawiki. I do not know what objectcache is, probably is unused. categorylinks probably has to be rebuilt. We can run a mysql_upgrade --force if that helps with something.

Thanks for the context!
My main fear was what happens if the master has the old format and the slave has the new one? It is easy to try, if we rebuild the table it in db1095 we will have
db1052 (master) - old format -> db1095 (slave new format).
I will rebuild the __wmf_checksum table and force an insert on db1052, if that table breaks replication we do not care and we can skip it.

I will try that once the script sanitization script has finished (I just started it). We will see how it goes!

Script finished and now db1095 is trying to catch up with the master (I stopped replication).
I will audit the tables

The following tables have been renamed as they are in the list of ignored tables:

root@db1095:~# mysql -A enwiki -e "show tables like 'TO_DROP%';" -B
mysql -A enwiki -e "show tables like 'TO_DROP%';" -B
Tables_in_enwiki (TO_DROP%)
TO_DROP_arbcom1_vote
TO_DROP_blob_orphans
TO_DROP_blob_tracking
TO_DROP_bv2009_edits
TO_DROP_click_tracking
TO_DROP_cu_changes
TO_DROP_cu_log
TO_DROP_cur
TO_DROP_edit_page_tracking
TO_DROP_exarchive
TO_DROP_exrevision
TO_DROP_filejournal
TO_DROP_hidden
TO_DROP_job
TO_DROP_linkscc
TO_DROP_log_search
TO_DROP_long_run_profiling
TO_DROP_moodbar_feedback
TO_DROP_moodbar_feedback_response
TO_DROP_msg_resource
TO_DROP_objectcache
TO_DROP_old_growth
TO_DROP_pr_index
TO_DROP_prefstats
TO_DROP_prefswitch_survey
TO_DROP_profiling
TO_DROP_querycache
TO_DROP_querycache_info
TO_DROP_querycachetwo
TO_DROP_securepoll_cookie_match
TO_DROP_securepoll_elections
TO_DROP_securepoll_entity
TO_DROP_securepoll_lists
TO_DROP_securepoll_msgs
TO_DROP_securepoll_options
TO_DROP_securepoll_properties
TO_DROP_securepoll_questions
TO_DROP_securepoll_strike
TO_DROP_securepoll_voters
TO_DROP_securepoll_votes
TO_DROP_spoofuser
TO_DROP_text
TO_DROP_titlekey
TO_DROP_transcache
TO_DROP_uploadstash
TO_DROP_user_newtalk
TO_DROP_vote_log
TO_DROP_watchlist

Replication is looking good.

At a first glance, looks like the triggers are also working fine :

mysql:root@localhost [enwiki]> SELECT * FROM user WHERE user_password != '';
Empty set (19.68 sec)

mysql:root@localhost [enwiki]> SELECT * FROM recentchanges WHERE rc_ip != '';
Empty set (0.00 sec)

I just tested what happens with a master with the old timestamp format and a slave with the new one.

So the table: __wmf_checksums in db1052 has the old format and that same table on db1095 has the new one:

MariaDB PRODUCTION s1 localhost enwiki > select @@hostname;
+------------+
| @@hostname |
+------------+
| db1052     |
+------------+
1 row in set (0.00 sec)
MariaDB PRODUCTION s1 localhost enwiki > select t.table_schema,t.engine,t.table_name,c.column_name,c.column_type from information_schema.tables t    inner join information_schema.columns c on c.table_schema = t.table_schema and c.table_name = t.table_name   left outer join information_schema.innodb_sys_tables ist on ist.name = concat(t.table_schema,'/',t.table_name)   left outer join information_schema.innodb_sys_columns isc on isc.table_id = ist.table_id and isc.name = c.column_name  where c.column_type in ('time','timestamp','datetime')   and t.table_schema not in ('mysql','information_schema','performance_schema', 'ops', 'sys')   and t.table_type = 'base table'   and (t.engine != 'innodb' or (t.engine = 'innodb' and isc.mtype = 6)) order by t.table_schema,t.table_name,c.column_name;
+--------------+--------+-----------------+--------------+-------------+
| table_schema | engine | table_name      | column_name  | column_type |
+--------------+--------+-----------------+--------------+-------------+
| enwiki       | InnoDB | categorylinks   | cl_timestamp | timestamp   |
| enwiki       | InnoDB | objectcache     | exptime      | datetime    |
| enwiki       | InnoDB | __wmf_checksums | ts           | timestamp   |
+--------------+--------+-----------------+--------------+-------------+

mysql:root@localhost [(none)]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1095     |
+------------+
1 row in set (0.00 sec)

mysql:root@localhost [(none)]> select t.table_schema,t.engine,t.table_name,c.column_name,c.column_type from information_schema.tables t    inner join information_schema.columns c on c.table_schema = t.table_schema and c.table_name = t.table_name   left outer join information_schema.innodb_sys_tables ist on ist.name = concat(t.table_schema,'/',t.table_name)   left outer join information_schema.innodb_sys_columns isc on isc.table_id = ist.table_id and isc.name = c.column_name  where c.column_type in ('time','timestamp','datetime')   and t.table_schema not in ('mysql','information_schema','performance_schema', 'ops', 'sys')   and t.table_type = 'base table'   and (t.engine != 'innodb' or (t.engine = 'innodb' and isc.mtype = 6)) order by t.table_schema,t.table_name,c.column_name;
Empty set (0.06 sec)

Doing an insert on the master with:

insert into __wmf_checksums  (db,tbl,chunk,chunk_time,chunk_index,lower_boundary,upper_boundary,this_crc,this_cnt,master_crc,master_cnt,ts) values   ('test','test2',3,4.5,'test1','8safas','asjfsao9','test',1,'test',1,NULL)

I put NULL because:

`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

It worked fine:

MariaDB PRODUCTION s1 localhost enwiki > select @@hostname;
+------------+
| @@hostname |
+------------+
| db1052     |
+------------+
1 row in set (0.00 sec)

MariaDB PRODUCTION s1 localhost enwiki > select * from __wmf_checksums order by ts desc limit 1\G
*************************** 1. row ***************************
            db: test
           tbl: test2
         chunk: 3
    chunk_time: 4.5
   chunk_index: test1
lower_boundary: 8safas
upper_boundary: asjfsao9
      this_crc: test
      this_cnt: 1
    master_crc: test
    master_cnt: 1
            ts: 2016-11-24 08:01:39
1 row in set (0.00 sec)


select @@hostname;
+------------+
| @@hostname |
+------------+
| db1095     |
+------------+
1 row in set (0.00 sec)

mysql:root@localhost [enwiki]> select * from __wmf_checksums order by ts desc limit 1\G
*************************** 1. row ***************************
            db: test
           tbl: test2
         chunk: 3
    chunk_time: 4.5
   chunk_index: test1
lower_boundary: 8safas
upper_boundary: asjfsao9
      this_crc: test
      this_cnt: 1
    master_crc: test
    master_cnt: 1
            ts: 2016-11-24 08:01:39
1 row in set (0.00 sec)
`

So rebuilding the table (before the import) looks like the solution for this issue.

I have audited all the tables and columns appearing on the triggers below and they are all working fine. No private data there since I started replication after running the script yesterday evening:

mysql:root@localhost [enwiki]> show triggers\G
*************************** 1. row ***************************
             Trigger: abuse_filter_log_insert
               Event: INSERT
               Table: abuse_filter_log
           Statement: SET NEW.afl_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 2. row ***************************
             Trigger: abuse_filter_log_update
               Event: UPDATE
               Table: abuse_filter_log
           Statement: SET NEW.afl_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 3. row ***************************
             Trigger: archive_insert
               Event: INSERT
               Table: archive
           Statement: SET NEW.ar_text = '', NEW.ar_comment = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 4. row ***************************
             Trigger: archive_update
               Event: UPDATE
               Table: archive
           Statement: SET NEW.ar_text = '', NEW.ar_comment = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 5. row ***************************
             Trigger: mark_as_helpful_insert
               Event: INSERT
               Table: mark_as_helpful
           Statement: SET NEW.mah_system_type = '', NEW.mah_user_agent = '', NEW.mah_locale = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 6. row ***************************
             Trigger: mark_as_helpful_update
               Event: UPDATE
               Table: mark_as_helpful
           Statement: SET NEW.mah_system_type = '', NEW.mah_user_agent = '', NEW.mah_locale = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 7. row ***************************
             Trigger: recentchanges_insert
               Event: INSERT
               Table: recentchanges
           Statement: SET NEW.rc_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 8. row ***************************
             Trigger: recentchanges_update
               Event: UPDATE
               Table: recentchanges
           Statement: SET NEW.rc_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 9. row ***************************
             Trigger: revision_insert
               Event: INSERT
               Table: revision
           Statement: SET NEW.rev_text_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 10. row ***************************
             Trigger: revision_update
               Event: UPDATE
               Table: revision
           Statement: SET NEW.rev_text_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 11. row ***************************
             Trigger: user_insert
               Event: INSERT
               Table: user
           Statement: SET NEW.user_password = '', NEW.user_newpassword = '', NEW.user_email = '', NEW.user_options = '', NEW.user_token = '', NEW.user_email_authenticated = '', NEW.user_email_token = '', NEW.user_email_token_expires = '', NEW.user_newpass_time = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 12. row ***************************
             Trigger: user_update
               Event: UPDATE
               Table: user
           Statement: SET NEW.user_password = '', NEW.user_newpassword = '', NEW.user_email = '', NEW.user_options = '', NEW.user_token = '', NEW.user_email_authenticated = '', NEW.user_email_token = '', NEW.user_email_token_expires = '', NEW.user_newpass_time = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
12 rows in set (0.00 sec)

db1095 is ignoring 'heartbeat.%', and we need that for replication lag control to work (it may need manual importing, though).

Yes, you are right. It was ignored on the previous tests because ROW based replication was crashing as it wasn't imported. (only enwiki was imported)
I can import it in a bit and restore the normal repl filters.

Mentioned in SAL (#wikimedia-operations) [2016-11-24T09:38:14Z] <marostegui> Stopping replication db1052 (depooled) for maintenance - T150960

hearbeat is no longer being ignored.

Why was being ignored, puppet? If yes, we need to change it there.

Sorry, I didn't read the previous comment. Ignore.

No, me manually after importing S1.

So far so good, replication is going fine and I can see new records being inserted but sanitized well, two examples:

mysql:root@localhost [enwiki]> select rc_timestamp,rc_ip,rc_title from recentchanges order by rc_timestamp desc limit 3;
+----------------+-------+-----------------------------+
| rc_timestamp   | rc_ip | rc_title                    |
+----------------+-------+-----------------------------+
| 20161124135715 |       | removed                     |
| 20161124135715 |       | removed                     |
| 20161124135714 |       | removed                     |
+----------------+-------+-----------------------------+
3 rows in set (0.00 sec)


mysql:root@localhost [enwiki]> select rev_timestamp,rev_page,rev_text_id from revision order by rev_timestamp desc limit 3;
+----------------+----------+-------------+
| rev_timestamp  | rev_page | rev_text_id |
+----------------+----------+-------------+
| 20161124135904 | removed   |          0 |
| 20161124135904 | removed   |          0 |
| 20161124135903 | removed   |          0 |
+----------------+----------+-------------+
3 rows in set (0.00 sec)

I believe you, I am just unsure why it works. BTW, I've seen replication working on the main "channel" (no channel name). We should test it with multi-source (named channel).

I would like to leave it like this for some time and then do a more comprehensive pt-table-checksum. If master and slave are depooled/out of production, that should be relatively safe to do, do you think that is ok/needed before importing all shards?

I believe you, I am just unsure why it works. BTW, I've seen replication working on the main "channel" (no channel name). We should test it with multi-source (named channel).

Yes, this is because of this: T150960#2814221
There are two solutions for this

  1. for testing, leave those tables there so it replicates fine
  2. test the normal multi source on Monday when I will import S3 and S1 (I would be surprise if we have issues when using named channels.

I would like to leave it like this for some time and then do a more comprehensive pt-table-checksum. If master and slave are depooled/out of production, that should be relatively safe to do, do you think that is ok/needed before importing all shards?

We can do that, but how can we filter well if the content on db1052 and db1095 is different (because of the sanitization)?

We can do that, but how can we filter well if the content on db1052 and db1095 is different (because of the sanitization)?

We can check the public tables as normal, and the filtered tables, it has the option to checksum only some columns. Then with another check, we check that the sanitized columns are all deleted.

This is still looking good, so on Monday I will start copying S3 and importing S1.

jcrespo assigned this task to Marostegui.

I would say the testing is done, let's keep the tracking of the rest of the tasks on T150802.