Page MenuHomePhabricator

Validate integrity of the failover replica database an-coord1002 against its primary an-coord1001
Closed, ResolvedPublic

Description

As described in T279440#7489667, we have observed some drift between the primary analytics_meta database server (an-coord1001) and its failover replica (an-coord1002).

This may be limited to the superset_production database, but it would be useful to ascertain whether any other databases have been affected and as such, whether we can trust the veracity of the replica.

If it turns out that the drift is greater than we expect, we can rebuild the replica on an-coord1002.

Event Timeline

BTullis triaged this task as High priority.Nov 11 2021, 5:29 PM

I propose to use pt-table-checksum to carry out this work, but I will limit its recursion to an-coord1002. The reason being that the tool will not be able to connect to db1108 as it is in another vlan.
We just re-created the analytics_meta instance on db1108 as well from a snapshot in T295312 so I'm less concerned about this.

I will create a percona database and a dsns table, then use the -recursion-method=dsn method to limit the tool to this one replica: https://www.percona.com/doc/percona-toolkit/LATEST/pt-table-checksum.html#cmdoption-pt-table-checksum-recursion-method

Also I notice that we are excluding the superset_staging table from replication:
From show slave status\G on an-coord1002:

Replicate_Wild_Ignore_Table: superset\_staging.%

See the following:

Created database and populated dsn table.

MariaDB [(none)]> create database percona;
MariaDB [(none)]> use percona;
MariaDB [percona]> CREATE TABLE `dsns` (
    ->   `id` int(11) NOT NULL AUTO_INCREMENT,
    ->   `parent_id` int(11) DEFAULT NULL,
    ->   `dsn` varchar(255) NOT NULL,
    ->   PRIMARY KEY (`id`)
    MariaDB [percona]> insert into dsns(id,parent_id,dsn) values (1,0,'h=an-coord1002.eqiad.wmnet,u=repl,p=redacted');
-> );

I had to change the dns to use the IPv4 address, because the grant table is inconsistent regarding IPv4 and IPv6 addresses.

As in, this doesn't work from an-coord1001:
mysql -h an-coord1002.eqiad.wmnet -u repl -p
...but this does work:
mysql -h 10.64.36.107 -u repl -p

Something to fix another time.

I had to add the REPLICATION CLIENT privilege to the repl@% user account, since this was unable to execute the SHOW MASTER STATUS command.
I carried out the following on an-coord1001:

MariaDB [mysql]> GRANT REPLICATION CLIENT on *.* to `repl`@`10.%`;
Query OK, 0 rows affected (0.001 sec)

MariaDB [mysql]> flush privileges;

Before the change:

+--------------------------------------------------------------------------------------------------------------------+
| Grants for repl@10.%                                                                                               |
+--------------------------------------------------------------------------------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO `repl`@`10.%` IDENTIFIED BY PASSWORD 'redacted'                                  |
+--------------------------------------------------------------------------------------------------------------------+

After the change:

mysql:repl@10.64.36.107 [(none)]> show grants;
+----------------------------------------------------------------------------------------------------------------------------------------+
| Grants for repl@10.%                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------+
| GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO `repl`@`10.%` IDENTIFIED BY PASSWORD 'redacted'                                  |
+----------------------------------------------------------------------------------------------------------------------------------------+

Starting checksum operation:

btullis@an-coord1001:~$ sudo pt-table-checksum --recursion-method=dsn=D=percona,t=dsns --no-check-binlog-format --ignore-databases superset_staging --no-check-replication-filters

A couple of iterations later, this is the result.

btullis@an-coord1001:~$ sudo ./pt-table-checksum --recursion-method=dsn=D=percona,t=dsns --no-check-binlog-format --ignore-databases superset_staging --no-check-replication-filters --chunk-size-limit=4 --quiet
Checking if all tables can be checksummed ...
Starting checksum ...
            TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
11-12T15:17:46      0      1       19          4       1       0   0.018 mysql.db
11-12T15:17:46      0      1       23          3       1       0   0.020 mysql.global_priv
11-12T15:17:46      0      1        0         40       1       0   0.022 mysql.gtid_slave_pos
11-12T15:17:46      0      1        5          0       1       0   0.023 mysql.proc
11-12T15:17:46      0      1        2          1       1       0   0.021 mysql.proxies_priv

So on the face of it that's good. It indicates that all databases are fully synced, with the exception of some small discrepancies in the mysql table, which we can investigate and rectify with pt-table-sync.

However, it doesn't change the fact that the number of rows reported by these two queries on the two hosts is still different.

btullis@an-coord1001:~$ sudo mysql -e "SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';"
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          874233 |
+-----------------+
btullis@an-coord1002:~$ sudo mysql -e "SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production';"
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|         1542235 |
+-----------------+

Here's an interesting thing, it appears that the primary (an-coord1001) is under-reporting the values in the INFORMATION_SCHEMA.TABLES and the repilica is over-reporting them.

P17739 shows a full set of row counts from the superset_production database, as reported by pt-table-checksum:

We'll take one table, namely the superset_production.logs table.

            TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
11-12T16:15:41      0      0  1214702          0      14       0   6.434 superset_production.logs

A full count of the table shows the correct number (plus a few logs that have been added in the last 30 minutes or so)

btullis@an-coord1001:~$ sudo mysql -e "select count(*) as table_rows from superset_production.logs"
+------------+
| table_rows |
+------------+
|    1214718 |
+------------+

Getting the row count immediately on an-coord1002 also shows the same value:

btullis@an-coord1002:~$ sudo mysql -e "select count(*) as table_rows from superset_production.logs"
+------------+
| table_rows |
+------------+
|    1214718 |
+------------+

However, examining the information schema row count for this table on an-coord1001 returns a value about 400K lower than the actual row count.

btullis@an-coord1001:~$ sudo mysql -e "SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production' AND TABLE_NAME = 'logs'"
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|          831410 |
+-----------------+

Running the same command on an-coord1002 shows a value that is about 38K higher than the actual number of rows.

btullis@an-coord1002:~$ sudo mysql -e "SELECT SUM(TABLE_ROWS) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'superset_production' AND TABLE_NAME = 'logs'"
+-----------------+
| SUM(TABLE_ROWS) |
+-----------------+
|         1499850 |
+-----------------+

So it would appear that these estimates are way off the mark.

I could run an ANALYZE TABLE against these tables, which is a low low risk operation.
It will lock the table briefly and it can either be run locally or it can allow the analyze command to propagate via the binlog.
https://mariadb.com/kb/en/analyze-table/

All in all though, I'm not too concerned about this, now that the pt-table-sync command has reported no drift, apart from minor discrepanies within the mysql database.

I will clean up the permissions that I changed in order to be able to run the command and resolve the ticket.

1Checking if all tables can be checksummed ...
2Starting checksum ...
3 TS ERRORS DIFFS ROWS DIFF_ROWS CHUNKS SKIPPED TIME TABLE
411-12T16:15:33 0 0 109 0 1 0 0.051 superset_production.ab_permission
511-12T16:15:33 0 0 4032 0 1 0 0.064 superset_production.ab_permission_view
611-12T16:15:33 0 0 874 0 1 0 0.051 superset_production.ab_permission_view_role
711-12T16:15:33 0 0 0 0 1 0 0.040 superset_production.ab_register_user
811-12T16:15:33 0 0 6 0 1 0 0.024 superset_production.ab_role
911-12T16:15:33 0 0 289 0 1 0 0.031 superset_production.ab_user
1011-12T16:15:33 0 0 291 0 1 0 0.026 superset_production.ab_user_role
1111-12T16:15:33 0 0 3801 0 1 0 0.068 superset_production.ab_view_menu
1211-12T16:15:33 0 0 0 0 1 0 0.051 superset_production.access_request
1311-12T16:15:33 0 0 1 0 1 0 0.047 superset_production.alembic_version
1411-12T16:15:33 0 0 0 0 1 0 0.047 superset_production.alert_logs
1511-12T16:15:33 0 0 0 0 1 0 0.047 superset_production.alert_owner
1611-12T16:15:33 0 0 0 0 1 0 0.049 superset_production.alerts
1711-12T16:15:33 0 0 6 0 1 0 0.063 superset_production.annotation
1811-12T16:15:33 0 0 5 0 1 0 0.055 superset_production.annotation_layer
1911-12T16:15:34 0 0 0 0 1 0 0.048 superset_production.cache_keys
2011-12T16:15:34 0 0 2 0 1 0 0.047 superset_production.clusters
2111-12T16:15:34 0 0 943 0 1 0 0.055 superset_production.columns
2211-12T16:15:34 0 0 0 0 1 0 0.047 superset_production.css_templates
2311-12T16:15:34 0 0 0 0 1 0 0.053 superset_production.dashboard_email_schedules
2411-12T16:15:34 0 0 0 0 1 0 0.049 superset_production.dashboard_roles
2511-12T16:15:34 0 0 955 0 1 0 0.305 superset_production.dashboard_slices
2611-12T16:15:34 0 0 208 0 1 0 0.045 superset_production.dashboard_user
2711-12T16:15:34 0 0 179 0 1 0 0.051 superset_production.dashboards
2811-12T16:15:34 0 0 52 0 1 0 0.272 superset_production.datasources
2911-12T16:15:35 0 0 6 0 1 0 0.050 superset_production.dbs
3011-12T16:15:35 0 0 1 0 1 0 0.250 superset_production.druiddatasource_user
3111-12T16:15:35 0 0 0 0 1 0 0.042 superset_production.dynamic_plugin
3211-12T16:15:35 0 0 156 0 1 0 0.042 superset_production.favstar
3311-12T16:15:35 0 0 7 0 1 0 0.024 superset_production.keyvalue
3411-12T16:15:41 0 0 1214702 0 14 0 6.434 superset_production.logs
3511-12T16:15:41 0 0 254 0 1 0 0.028 superset_production.metrics
3611-12T16:15:42 0 0 19612 0 1 0 0.362 superset_production.query
3711-12T16:15:42 0 0 0 0 1 0 0.047 superset_production.report_execution_log
3811-12T16:15:42 0 0 0 0 1 0 0.034 superset_production.report_recipient
3911-12T16:15:42 0 0 0 0 1 0 0.031 superset_production.report_schedule
4011-12T16:15:42 0 0 0 0 1 0 0.033 superset_production.report_schedule_user
4111-12T16:15:42 0 0 0 0 1 0 0.030 superset_production.rls_filter_roles
4211-12T16:15:42 0 0 0 0 1 0 0.023 superset_production.rls_filter_tables
4311-12T16:15:42 0 0 0 0 1 0 0.023 superset_production.row_level_security_filters
4411-12T16:15:42 0 0 328 0 1 0 0.027 superset_production.saved_query
4511-12T16:15:42 0 0 0 0 1 0 0.025 superset_production.slice_email_schedules
4611-12T16:15:42 0 0 1655 0 1 0 0.026 superset_production.slice_user
4711-12T16:15:42 0 0 1490 0 1 0 0.062 superset_production.slices
4811-12T16:15:42 0 0 1157 0 1 0 0.029 superset_production.sql_metrics
4911-12T16:15:42 0 0 0 0 1 0 0.022 superset_production.sql_observations
5011-12T16:15:42 0 0 915 0 1 0 0.023 superset_production.sqlatable_user
5111-12T16:15:42 0 0 0 0 1 0 0.028 superset_production.tab_state
5211-12T16:15:42 0 0 5512 0 1 0 0.088 superset_production.table_columns
5311-12T16:15:42 0 0 0 0 1 0 0.024 superset_production.table_schema
5411-12T16:15:42 0 0 1016 0 1 0 0.049 superset_production.tables
5511-12T16:15:42 0 0 59 0 1 0 0.025 superset_production.tag
5611-12T16:15:43 0 0 955 0 1 0 0.030 superset_production.tagged_object
5711-12T16:15:43 0 0 840 0 1 0 0.033 superset_production.url
5811-12T16:15:43 0 0 0 0 1 0 0.037 superset_production.user_attribute

MariaDB [(none)]> select User,Host from mysql.user where User='checksum_user';
+---------------+------+
| User          | Host |
+---------------+------+
| checksum_user | %    |
+---------------+------+
1 row in set (0.002 sec)

MariaDB [(none)]> drop user if exists 'checksum_user'@'%';
Query OK, 0 rows affected (0.004 sec)

MariaDB [(none)]> select User,Host from mysql.user where User='checksum_user';
Empty set (0.002 sec)
MariaDB [(none)]> show grants for 'repl'@'10.%';
+----------------------------------------------------------------------------------------------------------------------------------------+
| Grants for repl@10.%                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------+
| GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO `repl`@`10.%` IDENTIFIED BY PASSWORD '*EFBEB2726B64952011173EBB677E876E9A545A42' |
+----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> REVOKE REPLICATION CLIENT on *.* FROM 'repl'@'10.%';
Query OK, 0 rows affected (0.001 sec)

MariaDB [(none)]> show grants for 'repl'@'10.%';
+--------------------------------------------------------------------------------------------------------------------+
| Grants for repl@10.%                                                                                               |
+--------------------------------------------------------------------------------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO `repl`@`10.%` IDENTIFIED BY PASSWORD '*EFBEB2726B64952011173EBB677E876E9A545A42' |
+--------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)

So it's only these differences left.

            TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
11-12T15:17:46      0      1       19          4       1       0   0.018 mysql.db
11-12T15:17:46      0      1       23          3       1       0   0.020 mysql.global_priv
11-12T15:17:46      0      1        0         40       1       0   0.022 mysql.gtid_slave_pos
11-12T15:17:46      0      1        5          0       1       0   0.023 mysql.proc
11-12T15:17:46      0      1        2          1       1       0   0.021 mysql.proxies_priv

Quickly scanning the mysql.db table it looks like there are 4 additional rows on the master, which are grants for:

  • hue_next@%
  • superset_staging@%
  • oozie@127.0.0.1 (in addition to oozie@localhost)
  • hive_metastore@127.0.0.1 (in addition to hive_metatore@localhost)

I'm not even going to worry about getting these in sync, given that we're going to be migrating from an-coord1002 soon.