Data drifts between superset_production on an-coord1001 and db1108
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	elukey
	Apr 6 2021, 3:00 PM

Description

Last Friday I worked a bit on db1108's meta replication, since for some reason it was broken after Razzi's upgrade of Superset. The interesting thing was that the replica on an-coord1002 was ok, and only the one on db1108 broke.

The failures were related to statements like "Cannot add column X to table Y because X is already present". I checked the failing table on db1108's superset_production and it was empty, so I had to disable the binlog and drop the column manually to allow the replication to restart (I assumed they were new tables and that the upgrade/rollback/upgrade actions somehow caused inconsistency).

Today I checked the status of the row counts across tables on an-coord1001, 1002 and 1108, they are not the same:

1001:

MariaDB [(none)]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          499940 |
+-----------------+
1 row in set (0.00 sec)

1002:

MariaDB [(none)]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          495349 |
+-----------------+
1 row in set (0.001 sec)

1108:

+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          376950 |
+-----------------+
1 row in set (0.002 sec)

The main difference seems to be in the log table entries, that I have still to check what it is.

On db1108 the binlog_format is MIXED, meanwhile on 1001/1002 it is ROW. I suspect that over time db1108 drifted for some reason, and that now we ended up in problems while migrating/upgrading superset.

We should probably re-create the meta instance on db1108 with binlog_format ROW, letting it replicate again from an-coord1001.

Details

	Subject	Repo	Branch	Lines +/-
	Sync db1108's my.cnf settings with analytics_meta	operations/puppet	production	+82 -68
	analytics-meta.cnf.erb - set character set to binary	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	BTullis	T280905 Analytics coordinator failover improvements
Resolved	BTullis	T284150 Bring an-mariadb100[12] into service
Duplicate	None	T279440 Data drifts between superset_production on an-coord1001 and db1108

Event Timeline

elukey created this task.Apr 6 2021, 3:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 6 2021, 3:00 PM

elukey added a project: Analytics-Kanban.Apr 15 2021, 1:01 PM

• fdans triaged this task as High priority.Apr 15 2021, 5:26 PM

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.

@Ottomata @razzi I think that we should do this sooner rather than later, do you want me to do it or do you prefer to do it during May?

@elukey no preference, but if you do it can you sync with Razzi so he learns how as well? TY!

I want to work on this! Is it ok to drop superset_production on db1108 in order to do this? If so, I think I'll be able to figure it out with some trial and error.

In terms of the scope of this ticket, is the 1001 / 1002 difference acceptable, or is that worth inspecting? I think the best way to check if replication is working would be to attempt to restore from 1002 or failover to 1002.

In T279440#7050361, @razzi wrote:

I want to work on this! Is it ok to drop superset_production on db1108 in order to do this? If so, I think I'll be able to figure it out with some trial and error.

Sure :) but as always, we need to understand the issue first and the come up with the plan. This problem is not related only to the superset database, but to all the ones that are replicated from an-coord001 (and possibly also from matomo1002). As outlined in the description, the replication of the binlog should be ROW instead of MIXED to guarantee consistency over time, and since we are running (on an-coord1001) a multi-db instance (rather than a multi-instance) the binlog is the same for all databases. So the high level approach could be:

stop mariadb on db1108
wipe the meta replication instance on db1108
merge the change to set the binlog_format to ROW
copy the entire db data from an-coord1002 (also stopped etc..)
restart the replication

I have never done it before without Data Persistence's assistance, so we'll have to ask to them to be sure. From a past chat with Manuel these are the options:

delete srv/sqldadata.whatever on db1108
stop replica on an-coord1002, and run show slave status\G and grab the output
stop mysql on 1002
copy srv/sqldata.whatever to db1108
start mysql on db1108, and run: stop slave, reset slave all;
and then ping me to help you with setting up replication

The other procedure could be to simply restart fresh on db1108, but we'd need to re-set up all users/grants/etc..

Last but not the least: the MIXED vs ROW problem could also be happening for Matomo, so we should check it too.

In terms of the scope of this ticket, is the 1001 / 1002 difference acceptable, or is that worth inspecting? I think the best way to check if replication is working would be to attempt to restore from 1002 or failover to 1002.

I wouldn't failover to 1002 since we haven't done it yet, there are a lot of moving parts (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta#Failover). We should do it sooner or later, but if our DB architecture will change in Q1/Q2 next fiscal with one active/standby node (so no more an-coord100X hostnames hardcoded into puppet) the failover procedure will surely become easier. We can use 1002 as good replica, it runs ROW replication as 1001 and it didn't fail when you operated on the superset_production db.

I'll let you do some research to scope the work and come up with a plan, if I were you I'd follow up with Data Persistence on IRC to ask for suggestions (only when you'll have a plan etc..).

Instead of doing this work to recreate the replicas with a different binlog format now, could we wait for the new db hardware, set up multi instance MariaDBs, and then enable the proper binlog format then? We'll basically be recreating each replica, so it might more be worth our time to just do that then.

I would do it anyway since these are the dbs that we back up periodically, and it may take a while (namely months) to get everything set up and running and migrated. Since it is mostly my fault I can spend the time on it, but if the team thinks it is not worth it I can drop the ball and decline :)

Ottomata added a parent task: T280905: Analytics coordinator failover improvements.Jun 2 2021, 2:00 PM

Ottomata mentioned this in T284150: Bring an-mariadb100[12] into service.Jun 2 2021, 2:09 PM

Before we pull the trigger, I'd like to run pt-table-checksum between an-coord1001 and db1108 (or any equivalent tool) to see if we have data drifts and where. Will sync with Manuel to understand how/when to do it, but it seems a good middle step to figure out the current state of db1108.

Ottomata removed • razzi as the assignee of this task.Jul 12 2021, 3:59 PM

Ottomata added a parent task: T284150: Bring an-mariadb100[12] into service.Jul 12 2021, 4:32 PM

odimitrijevic removed a project: Analytics-Kanban.Oct 28 2021, 5:45 AM

Am working on T284150: Bring an-mariadb100[12] into service, and realized that recreating all the 'analytics_meta' databases on db1108 from an-coord100[12] should probably be done first. Then, db1108's analytics_meta will be in sync with the master. We can then do a migration to an-db100[12] and swap db1108's master.

@elukey, I don't know if pt-table-checksum will get us any more useful information than 'yes, db1108 is out of sync'. I think fact that binlog_format (and possibly other) settings were out of sync is enough motivation to just recreate db1108's databases and start from a clean state.

Change 736777 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] analytics-meta.cnf.erb - set character set to binary

https://gerrit.wikimedia.org/r/736777

Change 736777 abandoned by Ottomata:

[operations/puppet@production] analytics-meta.cnf.erb - set character set to binary

Reason:

Actually, it will be better to just sync the replicas with the master setting, not vice versa.

https://gerrit.wikimedia.org/r/736777

Change 736780 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Sync db1108's my.cnf settings with analytics_meta

https://gerrit.wikimedia.org/r/736780

Ok, plan for recreating analytics_meta on db1108. We'll be synchronizing some mariadb settings, so we'll need a restart of the matomo replica instance too (but no need for data dir restore here.)

On db1108:

sudo mysql -S /var/run/mysqld/mysqld.analytics_meta.sock;
stop slave;

# we will be restarting matomo replica to sync some settings, let's stop it's mariadb too.
sudo mysql -S /var/run/mysqld/mysqld.matomo.sock;
stop slave;

sudo service mariadb@analytics_meta stop
sudo mv /srv/sqldata.analytics_meta /srv/sqldata.analytics_meta.db1108.T279440

sudo service mariadb@matomo stop
# Make a temp copy of matomo's datadir just in case.
sudo cp -a /srv/sqldata.matomo /srv/sqldata.matomo.db1108.T279440

Merge and apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/736780/

Restart matomo mariadb to pick up new settings:

sudo service mariadb@matomo start
sudo mysql -S /var/run/mysqld/mysqld.matomo.sock;
start slave;

Make sure matomo replica looks ok. Then proceed with analytics-meta restore on db1108.

On an-coord1002:

sudo mysql -S /var/run/mysqld/mysqld.analytics_meta.sock

stop slave;
show slave status\G

Write down output of show slave status.

Copy an-coord1002's /srv/sqldata to db1108's /srv/sqldata.analytics_meta. On cumin1001:

sudo transfer.py an-coord1002.eqiad.wmnet:/srv/sqldata db1108.eqiad.wmnet:/srv/tmp/`

Move the copied sqldata to the proper datadir and fix ownership on db1108:

sudo mv /srv/tmp/sqldata /srv/sqldata.analytics_meta
chown -R mysql:mysql /srv/sqldata.analytics_meta

Start mariadb@analytics_meta on db1108 and then start the slave:

# Just to be safe:
sudo systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
sudo service mariadb@analytics_meta start
sudo mysql -S /var/run/mysqld/mysqld.analytics_meta.sock
CHANGE MASTER TO MASTER_GTID=XXXXXX; START SLAVE;  # (should use value from show slave status before.  Q: should we be using offset position instead of GTID?

Verify that slave is running and things look okay.

sudo systemctl unset-environment MYSQLD_OPTS

Eventually, remove the old sqldata datadir:

sudo rm -rf /srv/sqldata.analytics_meta.db1108.T279440
sudo rm -rf /srv/sqldata.matomo.db1108.T279440

Very minor, but you will probably need a chown -R mysql:mysql /srv/sqldata.analytics_meta in there, because the uid/gid values for the mysql user are different between an-coord100[12] and db1008.

btullis@an-coord1001:~$ id mysql
uid=998(mysql) gid=998(mysql) groups=998(mysql)

btullis@db1108:~$ id mysql
uid=499(mysql) gid=499(mysql) groups=499(mysql)

In T279440#7481773, @Ottomata wrote:
Ok, plan for recreating analytics_meta on db1108.

On db1108:
sudo service mariadb@analytics_meta stop
sudo mv /srv/sqldata.analytics_meta /srv/sqldata.analytics_meta.db1108.T279440

The above would probably need a slave stop before stopping the meta instance (just to do clean things).

On an-coord1002:
sudo mysql -S /var/run/mysqld/mysqld.analytics_meta.sock

stop slave;
show slave status\G
Write down output of show slave status.

Copy an-coord1002's /srv/sqldata to db1108's /srv/sqldata.analytics_meta. On cumin1001:
sudo transfer.py an-coord1001.eqiad.wmnet:/srv/sqldata db1108.eqiad.wmnet:/srv/tmp/`

The above transfer.py command should mention an-coord1002 right?

Thanks both, updated comment with your notes! :)

Oh, somehow I had missed the fact that an-coord1002 IS ALSO OUT OF SYNC! And, it is even more out of sync than ever:

an-coord1001:

MariaDB [(none)]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          880661 |
+-----------------+

an-coord1002:

MariaDB [(none)]> SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|         1526934 |
+-----------------+

Other dbs are also out of sync (e.g. druid_public_eqiad). ALSO! matomo1002 -> db1108 is out of sync too.

So, really, do do this right, we should do a full restore of all replicas from the masters. I talked with Ben this morning, and we concluded that to simplify (and since we have dedicated db hardware to use anyway), we should move the matomo/piwik database off of matomo1002, and into the analytics-meta MariaDB instances. If we use analytics-meta for things like superset, we might as well also use it for matomo.

So, I'm going to merge this ticket into T284150: Bring an-mariadb100[12] into service and make a big plan in etherpad for this. I'll also update the description of T284150 to account for this.

Ottomata closed this task as a duplicate of T284150: Bring an-mariadb100[12] into service.Nov 8 2021, 3:46 PM

In T279440#7489667, @Ottomata wrote:

Oh, somehow I had missed the fact that an-coord1002 IS ALSO OUT OF SYNC! And, it is even more out of sync than ever:

This is very weird, we copied an-coord1001's data files (at the time) to an-coord1002 and both of them have the same replication scheme, I am wondering if either the select that we are using to test consistency or something else is not right. There is surely the possibility that I messed up the copy at the time, but it smells like something is not right (I am aware that I was the one introducing the test :D).

Other dbs are also out of sync (e.g. druid_public_eqiad). ALSO! matomo1002 -> db1108 is out of sync too.

So, really, do do this right, we should do a full restore of all replicas from the masters. I talked with Ben this morning, and we concluded that to simplify (and since we have dedicated db hardware to use anyway), we should move the matomo/piwik database off of matomo1002, and into the analytics-meta MariaDB instances. If we use analytics-meta for things like superset, we might as well also use it for matomo.

There is a cron on matomo1002 that runs periodically and updates tables on the db, IIRC mainly to update statistics etc.. In theory it shouldn't be that heavy, but I recall that, when we used to have way more traffic from the iOS apps, the performance hit was really noticeable (on a VM of course). Matomo hasn't been very performant in the past, if you want to merge all in analytics-meta keep an eye on db metrics afterwards (on bare metal it could not even be noticeable, but better safe than sorry).

if you want to merge all in analytics-meta keep an eye on db metrics afterwards (on bare metal it could not even be noticeable, but better safe than sorry).

Indeed, but, we are going from a 4GB innodb buffer pool / 4 core node for matomo, and an 8G buffer pool / 16 core node for analytics-meta, to a 100GB buffer pool / 40 core node on an-db1001.

For reads at least, all data will fit in memory. Perhaps if we get a buncha writes we could have some issues though and we should keep an eye on it.

it smells like something is not right

Indeed!

Ottomata mentioned this in T295312: Recreate analytics-meta replica on db1108 from master on an-coord1001.Nov 8 2021, 7:03 PM

Change 736780 merged by Btullis:

[operations/puppet@production] Sync db1108's my.cnf settings with analytics_meta

https://gerrit.wikimedia.org/r/736780

Maintenance_bot removed a project: Patch-For-Review.Nov 10 2021, 2:10 PM

I have rebuilt the replica on db1108 from a backup of an-coord1001, so that should all be good now and we will be in a better place in terms of having known good backups.

BTullis mentioned this in T295551: Validate integrity of the failover replica database an-coord1002 against its primary an-coord1001.Nov 11 2021, 1:24 PM

To close this off, I performed a validation using pt-table-checksum in T295551 and I've found that it's the values in INFORMATION_SCHEMA.TABLES that are wrong.
From here: https://dev.mysql.com/doc/mysql-infoschema-excerpt/5.7/en/information-schema-tables-table.html

Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40% to 50%.
In such cases, use SELECT COUNT(*) to obtain an accurate count.
For InnoDB tables, the row count is only a rough estimate used in SQL optimization.

Phewf!!!!

Nice!

Data drifts between superset_production on an-coord1001 and db1108Closed, DuplicatePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Data drifts between superset_production on an-coord1001 and db1108
Closed, DuplicatePublic
Actions

Related Objects
Search...