Page MenuHomePhabricator

Data drifts between superset_production on an-coord1001 and db1108
Open, HighPublic

Description

Last Friday I worked a bit on db1108's meta replication, since for some reason it was broken after Razzi's upgrade of Superset. The interesting thing was that the replica on an-coord1002 was ok, and only the one on db1108 broke.

The failures were related to statements like "Cannot add column X to table Y because X is already present". I checked the failing table on db1108's superset_production and it was empty, so I had to disable the binlog and drop the column manually to allow the replication to restart (I assumed they were new tables and that the upgrade/rollback/upgrade actions somehow caused inconsistency).

Today I checked the status of the row counts across tables on an-coord1001, 1002 and 1108, they are not the same:

1001:

MariaDB [(none)]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          499940 |
+-----------------+
1 row in set (0.00 sec)

1002:

MariaDB [(none)]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          495349 |
+-----------------+
1 row in set (0.001 sec)

1108:

+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          376950 |
+-----------------+
1 row in set (0.002 sec)

The main difference seems to be in the log table entries, that I have still to check what it is.

On db1108 the binlog_format is MIXED, meanwhile on 1001/1002 it is ROW. I suspect that over time db1108 drifted for some reason, and that now we ended up in problems while migrating/upgrading superset.

We should probably re-create the meta instance on db1108 with binlog_format ROW, letting it replicate again from an-coord1001.

Event Timeline

fdans triaged this task as High priority.Apr 15 2021, 5:26 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

@Ottomata @razzi I think that we should do this sooner rather than later, do you want me to do it or do you prefer to do it during May?

@elukey no preference, but if you do it can you sync with Razzi so he learns how as well? TY!

I want to work on this! Is it ok to drop superset_production on db1108 in order to do this? If so, I think I'll be able to figure it out with some trial and error.

In terms of the scope of this ticket, is the 1001 / 1002 difference acceptable, or is that worth inspecting? I think the best way to check if replication is working would be to attempt to restore from 1002 or failover to 1002.

I want to work on this! Is it ok to drop superset_production on db1108 in order to do this? If so, I think I'll be able to figure it out with some trial and error.

Sure :) but as always, we need to understand the issue first and the come up with the plan. This problem is not related only to the superset database, but to all the ones that are replicated from an-coord001 (and possibly also from matomo1002). As outlined in the description, the replication of the binlog should be ROW instead of MIXED to guarantee consistency over time, and since we are running (on an-coord1001) a multi-db instance (rather than a multi-instance) the binlog is the same for all databases. So the high level approach could be:

  • stop mariadb on db1108
  • wipe the meta replication instance on db1108
  • merge the change to set the binlog_format to ROW
  • copy the entire db data from an-coord1002 (also stopped etc..)
  • restart the replication

I have never done it before without Data Persistence's assistance, so we'll have to ask to them to be sure. From a past chat with Manuel these are the options:

  • delete srv/sqldadata.whatever on db1108
  • stop replica on an-coord1002, and run show slave status\G and grab the output
  • stop mysql on 1002
  • copy srv/sqldata.whatever to db1108
  • start mysql on db1108, and run: stop slave, reset slave all;
  • and then ping me to help you with setting up replication

The other procedure could be to simply restart fresh on db1108, but we'd need to re-set up all users/grants/etc..

Last but not the least: the MIXED vs ROW problem could also be happening for Matomo, so we should check it too.

In terms of the scope of this ticket, is the 1001 / 1002 difference acceptable, or is that worth inspecting? I think the best way to check if replication is working would be to attempt to restore from 1002 or failover to 1002.

I wouldn't failover to 1002 since we haven't done it yet, there are a lot of moving parts (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta#Failover). We should do it sooner or later, but if our DB architecture will change in Q1/Q2 next fiscal with one active/standby node (so no more an-coord100X hostnames hardcoded into puppet) the failover procedure will surely become easier. We can use 1002 as good replica, it runs ROW replication as 1001 and it didn't fail when you operated on the superset_production db.

I'll let you do some research to scope the work and come up with a plan, if I were you I'd follow up with Data Persistence on IRC to ask for suggestions (only when you'll have a plan etc..).

Instead of doing this work to recreate the replicas with a different binlog format now, could we wait for the new db hardware, set up multi instance MariaDBs, and then enable the proper binlog format then? We'll basically be recreating each replica, so it might more be worth our time to just do that then.

I would do it anyway since these are the dbs that we back up periodically, and it may take a while (namely months) to get everything set up and running and migrated. Since it is mostly my fault I can spend the time on it, but if the team thinks it is not worth it I can drop the ball and decline :)

Before we pull the trigger, I'd like to run pt-table-checksum between an-coord1001 and db1108 (or any equivalent tool) to see if we have data drifts and where. Will sync with Manuel to understand how/when to do it, but it seems a good middle step to figure out the current state of db1108.