run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038)
Open, HighPublic

Description

Updated title task on 24th March 2017: I have changed the title to reflect that it contains the data for s2 as it was the initial shard that was checksummed for testing, so we used this ticket for it. The rest of the shards have their own ticket.

In s3 the following servers will be decommissioned (T134476)

db1038 - old master
db1015 - rc
db1035
db1044 - temporary master for db1095 (sanitarium2)

Before shutting them down for good, we need to make sure we are not losing any data, so we'd need to run pt-table-checksum on them.
Some modifications might be needed in order to run it safely.

There are a very large number of changes, so older changes are hidden. Show Older Changes

Good point: pt-table-checksum 2.2.20

I do not know when that was fixed- grep -A 15 ' CREATE TABLE checksums' $(which pt-table-checksum) should force the table or the char fields to be utf8, or execution will fail on our production (getting stuck forever).

It is fixed on 2.2.19 and 2.2.20 is the one I have in my home directory (but we should indeed upgrade the system one):

root@neodymium:/home/marostegui/percona-toolkit-2.2.20/bin# grep -A 15 ' CREATE TABLE checksums' pt-table-checksum
  CREATE TABLE checksums (
     db             CHAR(64)     NOT NULL,
     tbl            CHAR(64)     NOT NULL,
     chunk          INT          NOT NULL,
     chunk_time     FLOAT            NULL,
     chunk_index    VARCHAR(200)     NULL,
     lower_boundary TEXT             NULL,
     upper_boundary TEXT             NULL,
     this_crc       CHAR(40)     NOT NULL,
     this_cnt       INT          NOT NULL,
     master_crc     CHAR(40)         NULL,
     master_cnt     INT              NULL,
     ts             TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
     PRIMARY KEY (db, tbl, chunk),
     INDEX ts_db_tbl (ts, db, tbl)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

That makes sense, however we'd need to truncate the table after using it as it will be used to check specific slaves from different shards.

Let's create a dsns database on tendril, we can create one table per shard so we do not have to work twice next time! :-)

Sounds good. I will take care of that

Mentioned in SAL (#wikimedia-operations) [2017-02-15T11:38:07Z] <marostegui> Running pt-table-checksum on db1043 (m3 - phabricator master) - T154485

For the record I have created the dsns tables on tendril (and the first test with pt-table-checksum on m3 is using it).
The only one that has data so far is dsns_m3

MariaDB TENDRIL localhost dsns > show tables;
+----------------+
| Tables_in_dsns |
+----------------+
| dsns_m3        |
| dsns_s1        |
| dsns_s2        |
| dsns_s3        |
| dsns_s4        |
| dsns_s5        |
| dsns_s6        |
| dsns_s7        |
| dsns_s8        |
+----------------+
9 rows in set (0.00 sec)

MariaDB TENDRIL localhost dsns > show create table dsns_m3;
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table   | Create Table                                                                                                                                                                                                                                              |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| dsns_m3 | CREATE TABLE `dsns_m3` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `parent_id` int(11) DEFAULT NULL,
  `dsn` varchar(255) COLLATE utf8mb4_bin NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB TENDRIL localhost dsns > select * from dsns_m3;
+----+-----------+-----------------------------+
| id | parent_id | dsn                         |
+----+-----------+-----------------------------+
|  1 |      NULL | h=db1048.eqiad.wmnet,u=root |
|  2 |      NULL | h=db2012.codfw.wmnet,u=root |
+----+-----------+-----------------------------+
2 rows in set (0.00 sec)

The first test with phabricator_file database and generated a peak of 500 seconds lag on db1048 and db2012 while checksuming the biggest table of the database: file_storageblog (26GB).
I didn't specified any chunk-size or chunk-time to see how the tool would adjust (that means a chunk-size=1000 and chunk-time=0.5 by default), but it didn't do it so well (as I said, 500 seconds lag was generated). The --max-lag was set to "2"
This is what was run:

./pt-table-checksum --replicate-check-only --no-check-binlog-format --replicate=phabricator_file.__wmf_checksums --check-slave-lag db1048.eqiad.wmnet,db2012.codfw.wmnet --max-lag 1 -F /home/marostegui/.my.cnf  --databases phabricator_file h=db1043.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_m3

The biggest tables took 1744 seconds to be checksummed

For the next iteration I have specified a chunk-time of 0.1 which makes the tool slower but safer (we even get a warning about how slow it is).

root@neodymium:/home/marostegui/percona-toolkit-2.2.20/bin# ./pt-table-checksum --no-check-binlog-format --replicate=phabricator_file.__wmf_checksums --check-slave-lag db1048.eqiad.wmnet,db2012.codfw.wmnet --max-lag 1 --chunk-time 0.1 -F /home/marostegui/.my.cnf  --databases phabricator_file h=db1043.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_m3

<snip>

02-15T13:48:41 Checksum queries for table phabricator_file.file_storageblob are executing very slowly.  --chunk-size has been automatically reduced to 1.  Check that the server is not being overloaded, or increase --chunk-time.  The last chunk, number 92 of table phabricator_file.file_storageblob, selected 1 rows and took 0.123 seconds to execute.

Despite this, the replicas got around 40 seconds delay peak.
This time the table took 1886 seconds to be checksummed and it was a way more gentle to the replicas.

Another run I did was to leave the chunk-time to be the default one (0.5) and play with the chunk-size instead of the default 1000 rows to 500 rows.
It is a lot slower, but there was no lag reported by pt-table-checksum

./pt-table-checksum --no-check-binlog-format --replicate=phabricator_file.__wmf_checksums --check-slave-lag db1048.eqiad.wmnet,db2012.codfw.wmnet --max-lag 1 --chunk-size 500 -F /home/marostegui/.my.cnf  --databases phabricator_file h=db1043.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_m3

This time the whole check took around 40 minutes and 20 minutes to checksum a 26G table.
As I said, the tool didn't report any lag this time, but I left a script running to check whether that was true and it is mostly true, the max lag I was able to capture from a direct replica (within the same DC) was 19 seconds:

Seconds_Behind_Master: 13
Seconds_Behind_Master: 11
Seconds_Behind_Master: 13
Seconds_Behind_Master: 10
Seconds_Behind_Master: 16
Seconds_Behind_Master: 19
Seconds_Behind_Master: 13
Seconds_Behind_Master: 10

The time between each check is 5 seconds. So it was delayed for around 40 seconds but no more than 20 seconds at the time, which seems pretty good to me. It is true that those servers do not have any reads, so maybe we need to decrease the chunk-size a bit more for production.

The last test was chunk-size 200:

./pt-table-checksum --no-check-binlog-format --replicate=phabricator_file.__wmf_checksums --check-slave-lag db1048.eqiad.wmnet,db2012.codfw.wmnet --max-lag 1 --chunk-size 200 -F /home/marostegui/.my.cnf  --databases phabricator_file h=db1043.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_m3

The whole check took 1 hour. But it generated less lag even, these are the two lag spikes it got during the different table checks:

Seconds_Behind_Master: 1
Seconds_Behind_Master: 3
Seconds_Behind_Master: 5
Seconds_Behind_Master: 2
Seconds_Behind_Master: 3
Seconds_Behind_Master: 1

And

Seconds_Behind_Master: 1
Seconds_Behind_Master: 1
Seconds_Behind_Master: 1
Seconds_Behind_Master: 2
Seconds_Behind_Master: 3
Seconds_Behind_Master: 5
Seconds_Behind_Master: 7
Seconds_Behind_Master: 10
Seconds_Behind_Master: 8
Seconds_Behind_Master: 7
Seconds_Behind_Master: 6
Seconds_Behind_Master: 8
Seconds_Behind_Master: 6
Seconds_Behind_Master: 3

And:

Seconds_Behind_Master: 3
Seconds_Behind_Master: 8
Seconds_Behind_Master: 13
Seconds_Behind_Master: 12
Seconds_Behind_Master: 2

The last ran with chunk-size took 1:50h to finish the whole dataset. The lag generated was a bit less...

Seconds_Behind_Master: 1
Seconds_Behind_Master: 3
Seconds_Behind_Master: 5
Seconds_Behind_Master: 2
Seconds_Behind_Master: 3
Seconds_Behind_Master: 1

Another batch

Seconds_Behind_Master: 2
Seconds_Behind_Master: 3
Seconds_Behind_Master: 5
Seconds_Behind_Master: 7
Seconds_Behind_Master: 10
Seconds_Behind_Master: 8
Seconds_Behind_Master: 7
Seconds_Behind_Master: 6
Seconds_Behind_Master: 8
Seconds_Behind_Master: 6
Seconds_Behind_Master: 3

Another batch

Seconds_Behind_Master: 3
Seconds_Behind_Master: 8
Seconds_Behind_Master: 13
Seconds_Behind_Master: 12
Seconds_Behind_Master: 2

Another one

Seconds_Behind_Master: 2
Seconds_Behind_Master: 3
Seconds_Behind_Master: 7
Seconds_Behind_Master: 10
Seconds_Behind_Master: 5

So the chunk-size looks safer but it will take quite long (which is fine, I rather not generate lag in production)

As a side note: the --ignore-tables works fine

dbstore1001 struggled to execute the pt-table-checksum query:

REPLACE INTO `phabricator_file`.`__wmf_checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT 'phabricator_file', 'file_storageblob', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `id`, CRC32(`data`), `datecreated`, `datemodified`)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `phabricator_file`.`file_storageblob` /*checksum table*/

It made the server kind of hang (didn't crash at all and when I killed that replace that it went fine straightaway), this was the status:

root@DBSTORE[(none)]> show processlist;
+----------+-----------------+------------------+--------------------+---------+---------+----------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id       | User            | Host             | db                 | Command | Time    | State                            | Info                                                                                                 | Progress |
+----------+-----------------+------------------+--------------------+---------+---------+----------------------------------+------------------------------------------------------------------------------------------------------+----------+
|  7927022 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927024 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927026 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927028 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927030 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927032 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927034 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927036 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927038 | system user     |                  | NULL               | Connect | 1354892 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927050 | event_scheduler | localhost        | NULL               | Daemon  |       1 | Waiting for next activation      | NULL                                                                                                 |    0.000 |
| 10107012 | watchdog        | 10.64.0.15:48258 | mysql              | Sleep   |       6 |                                  | NULL                                                                                                 |    0.000 |
| 10138755 | watchdog        | 10.64.0.15:51997 | information_schema | Query   |     209 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_STATUS`                                        |    0.000 |
| 10142021 | watchdog        | 10.64.0.15:34146 | information_schema | Query   |     220 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_VARIABLES`                                     |    0.000 |
| 10142049 | watchdog        | 10.64.0.15:34909 | information_schema | Sleep   |      16 |                                  | NULL                                                                                                 |    0.000 |
| 10187049 | system user     |                  | phabricator_file   | Connect |   77588 | Sending data                     | REPLACE INTO `phabricator_file`.`__wmf_checksums` (db, tbl, chunk, chunk_index, lower_boundary, uppe |    0.000 |
| 10187055 | root            | localhost        | ops                | Connect |     231 | Killing slave                    | stop slave "m3" sql_thread                                                                           |    0.000 |
| 10187058 | watchdog        | 10.64.0.15:51748 | information_schema | Query   |     230 | init                             | SHOW ALL SLAVES STATUS                                                                               |    0.000 |
| 10187059 | nagios          | localhost        | NULL               | Query   |     230 | init                             | show slave status                                                                                    |    0.000 |
| 10187060 | nagios          | localhost        | NULL               | Query   |     230 | init                             | show slave status                                                                                    |    0.000 |
| 10187061 | nagios          | localhost        | NULL               | Query   |     230 | init                             | show slave status                                                                                    |    0.000 |
| 10187062 | nagios          | localhost        | NULL               | Query   |     230 | init                             | show slave status                                                                                    |    0.000 |
| 10187063 | nagios          | localhost        | NULL               | Query   |     228 | init                             | show slave status                                                                                    |    0.000 |
| 10187064 | nagios          | localhost        | NULL               | Query   |     227 | init                             | show slave status                                                                                    |    0.000 |
| 10187068 | nagios          | localhost        | NULL               | Query   |     213 | init                             | show slave status                                                                                    |    0.000 |
| 10187069 | nagios          | localhost        | NULL               | Query   |     213 | init                             | show slave status                                                                                    |    0.000 |
| 10187070 | nagios          | localhost        | NULL               | Query   |     213 | init                             | show slave status                                                                                    |    0.000 |
| 10187071 | nagios          | localhost        | NULL               | Query   |     213 | init                             | show slave status                                                                                    |    0.000 |
| 10187072 | nagios          | localhost        | NULL               | Query   |     213 | init                             | show slave status                                                                                    |    0.000 |
| 10187073 | nagios          | localhost        | NULL               | Query   |     213 | init                             | show slave status                                                                                    |    0.000 |
| 10187074 | nagios          | localhost        | NULL               | Query   |     213 | init                             | show slave status                                                                                    |    0.000 |
| 10187075 | nagios          | localhost        | NULL               | Query   |     212 | init                             | show slave status                                                                                    |    0.000 |
| 10187076 | nagios          | localhost        | NULL               | Query   |     212 | init                             | show slave status                                                                                    |    0.000 |
| 10187077 | nagios          | localhost        | NULL               | Query   |     212 | init                             | show slave status                                                                                    |    0.000 |
| 10187079 | nagios          | localhost        | NULL               | Query   |     208 | init                             | show slave status                                                                                    |    0.000 |
| 10187080 | nagios          | localhost        | NULL               | Query   |     208 | init                             | show slave status                                                                                    |    0.000 |
| 10187081 | nagios          | localhost        | NULL               | Query   |     208 | init                             | show slave status                                                                                    |    0.000 |
| 10187082 | nagios          | localhost        | NULL               | Query   |     207 | init                             | show slave status                                                                                    |    0.000 |
| 10187083 | nagios          | localhost        | NULL               | Query   |     207 | init                             | show slave status                                                                                    |    0.000 |
| 10187084 | nagios          | localhost        | NULL               | Query   |     207 | init                             | show slave status                                                                                    |    0.000 |
| 10187086 | nagios          | localhost        | NULL               | Query   |     205 | init                             | show slave status                                                                                    |    0.000 |
| 10187087 | nagios          | localhost        | NULL               | Query   |     204 | init                             | show slave status                                                                                    |    0.000 |
| 10187088 | nagios          | localhost        | NULL               | Query   |     204 | init                             | show slave status                                                                                    |    0.000 |
| 10187090 | nagios          | localhost        | NULL               | Query   |     201 | init                             | show slave status                                                                                    |    0.000 |
| 10187091 | nagios          | localhost        | NULL               | Query   |     201 | init                             | show slave status                                                                                    |    0.000 |
| 10187092 | prometheus      | localhost        | NULL               | Query   |     199 | Filling schema table             | SHOW GLOBAL STATUS                                                                                   |    0.000 |
| 10187093 | nagios          | localhost        | NULL               | Query   |     198 | init                             | show slave status                                                                                    |    0.000 |
| 10187094 | nagios          | localhost        | NULL               | Query   |     198 | init                             | show slave status                                                                                    |    0.000 |
| 10187095 | nagios          | localhost        | NULL               | Query   |     197 | init                             | show slave status                                                                                    |    0.000 |
| 10187096 | nagios          | localhost        | NULL               | Query   |     197 | init                             | show slave status                                                                                    |    0.000 |
| 10187098 | nagios          | localhost        | NULL               | Query   |     195 | init                             | show slave status                                                                                    |    0.000 |
| 10187099 | nagios          | localhost        | NULL               | Query   |     195 | init                             | show slave status                                                                                    |    0.000 |
| 10187100 | nagios          | localhost        | NULL               | Query   |     195 | init                             | show slave status                                                                                    |    0.000 |
| 10187101 | nagios          | localhost        | NULL               | Query   |     195 | init                             | show slave status                                                                                    |    0.000 |
| 10187102 | nagios          | localhost        | NULL               | Query   |     195 | init                             | show slave status                                                                                    |    0.000 |
| 10187103 | nagios          | localhost        | NULL               | Query   |     192 | init                             | show slave status                                                                                    |    0.000 |
| 10187105 | nagios          | localhost        | NULL               | Query   |     190 | init                             | show slave status                                                                                    |    0.000 |
| 10187106 | prometheus      | localhost        | NULL               | Query   |     189 | Filling schema table             | SHOW GLOBAL STATUS                                                                                   |    0.000 |
| 10187107 | nagios          | localhost        | NULL               | Query   |     189 | init                             | show slave status                                                                                    |    0.000 |
| 10187108 | nagios          | localhost        | NULL               | Query   |     188 | init                             | show slave status                                                                                    |    0.000 |
| 10187109 | nagios          | localhost        | NULL               | Query   |     188 | init                             | show slave status                                                                                    |    0.000 |
| 10187110 | nagios          | localhost        | NULL               | Query   |     188 | init                             | show slave status                                                                                    |    0.000 |
| 10187111 | nagios          | localhost        | NULL               | Query   |     188 | init                             | show slave status                                                                                    |    0.000 |
| 10187113 | nagios          | localhost        | NULL               | Query   |     186 | init                             | show slave status                                                                                    |    0.000 |
| 10187114 | nagios          | localhost        | NULL               | Query   |     186 | init                             | show slave status                                                                                    |    0.000 |
| 10187115 | watchdog        | 10.64.0.15:53215 | information_schema | Query   |     160 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_VARIABLES`                                     |    0.000 |
| 10187116 | nagios          | localhost        | NULL               | Query   |     185 | init                             | show slave status                                                                                    |    0.000 |
| 10187117 | nagios          | localhost        | NULL               | Query   |     184 | init                             | show slave status                                                                                    |    0.000 |
| 10187118 | nagios          | localhost        | NULL               | Query   |     184 | init                             | show slave status                                                                                    |    0.000 |
| 10187119 | nagios          | localhost        | NULL               | Query   |     183 | init                             | show slave status                                                                                    |    0.000 |
| 10187120 | nagios          | localhost        | NULL               | Query   |     182 | init                             | show slave status                                                                                    |    0.000 |
| 10187121 | nagios          | localhost        | NULL               | Query   |     182 | init                             | show slave status                                                                                    |    0.000 |
| 10187122 | nagios          | localhost        | NULL               | Query   |     182 | init                             | show slave status                                                                                    |    0.000 |
| 10187123 | nagios          | localhost        | NULL               | Query   |     182 | init                             | show slave status                                                                                    |    0.000 |
| 10187125 | nagios          | localhost        | NULL               | Query   |     180 | init                             | show slave status                                                                                    |    0.000 |
| 10187126 | watchdog        | 10.64.0.15:53464 | information_schema | Query   |     179 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_STATUS`                                        |    0.000 |
| 10187127 | nagios          | localhost        | NULL               | Query   |     177 | init                             | show slave status                                                                                    |    0.000 |
| 10187129 | nagios          | localhost        | NULL               | Query   |     176 | init                             | show slave status                                                                                    |    0.000 |
| 10187130 | nagios          | localhost        | NULL               | Query   |     175 | init                             | show slave status                                                                                    |    0.000 |
| 10187131 | nagios          | localhost        | NULL               | Query   |     174 | init                             | show slave status                                                                                    |    0.000 |
| 10187132 | nagios          | localhost        | NULL               | Query   |     174 | init                             | show slave status                                                                                    |    0.000 |
| 10187133 | nagios          | localhost        | NULL               | Query   |     174 | init                             | show slave status                                                                                    |    0.000 |
| 10187135 | nagios          | localhost        | NULL               | Query   |     169 | init                             | show slave status                                                                                    |    0.000 |
| 10187136 | nagios          | localhost        | NULL               | Query   |     169 | init                             | show slave status                                                                                    |    0.000 |
| 10187137 | nagios          | localhost        | NULL               | Query   |     169 | init                             | show slave status                                                                                    |    0.000 |
| 10187138 | nagios          | localhost        | NULL               | Query   |     169 | init                             | show slave status                                                                                    |    0.000 |
| 10187139 | nagios          | localhost        | NULL               | Query   |     167 | init                             | show slave status                                                                                    |    0.000 |
| 10187140 | nagios          | localhost        | NULL               | Query   |     167 | init                             | show slave status                                                                                    |    0.000 |
| 10187144 | nagios          | localhost        | NULL               | Query   |     153 | init                             | show slave status                                                                                    |    0.000 |
| 10187145 | nagios          | localhost        | NULL               | Query   |     153 | init                             | show slave status                                                                                    |    0.000 |
| 10187146 | nagios          | localhost        | NULL               | Query   |     153 | init                             | show slave status                                                                                    |    0.000 |
| 10187147 | nagios          | localhost        | NULL               | Query   |     153 | init                             | show slave status                                                                                    |    0.000 |
| 10187148 | nagios          | localhost        | NULL               | Query   |     153 | init                             | show slave status                                                                                    |    0.000 |
| 10187149 | nagios          | localhost        | NULL               | Query   |     153 | init                             | show slave status                                                                                    |    0.000 |
| 10187150 | nagios          | localhost        | NULL               | Query   |     153 | init                             | show slave status                                                                                    |    0.000 |
| 10187151 | nagios          | localhost        | NULL               | Query   |     152 | init                             | show slave status                                                                                    |    0.000 |
| 10187152 | nagios          | localhost        | NULL               | Query   |     152 | init                             | show slave status                                                                                    |    0.000 |
| 10187153 | nagios          | localhost        | NULL               | Query   |     152 | init                             | show slave status                                                                                    |    0.000 |
| 10187155 | watchdog        | 10.64.0.15:54473 | information_schema | Query   |     149 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_STATUS`                                        |    0.000 |
| 10187156 | nagios          | localhost        | NULL               | Query   |     149 | init                             | show slave status                                                                                    |    0.000 |
| 10187157 | nagios          | localhost        | NULL               | Query   |     149 | init                             | show slave status                                                                                    |    0.000 |
| 10187158 | nagios          | localhost        | NULL               | Query   |     149 | init                             | show slave status                                                                                    |    0.000 |
| 10187159 | nagios          | localhost        | NULL               | Query   |     147 | init                             | show slave status                                                                                    |    0.000 |
| 10187160 | nagios          | localhost        | NULL               | Query   |     147 | init                             | show slave status                                                                                    |    0.000 |
| 10187161 | nagios          | localhost        | NULL               | Query   |     147 | init                             | show slave status                                                                                    |    0.000 |
| 10187163 | nagios          | localhost        | NULL               | Query   |     145 | init                             | show slave status                                                                                    |    0.000 |
| 10187164 | nagios          | localhost        | NULL               | Query   |     144 | init                             | show slave status                                                                                    |    0.000 |
| 10187165 | nagios          | localhost        | NULL               | Query   |     144 | init                             | show slave status                                                                                    |    0.000 |
| 10187167 | nagios          | localhost        | NULL               | Query   |     141 | init                             | show slave status                                                                                    |    0.000 |
| 10187168 | nagios          | localhost        | NULL               | Query   |     141 | init                             | show slave status                                                                                    |    0.000 |
| 10187169 | prometheus      | localhost        | NULL               | Query   |     139 | Filling schema table             | SHOW GLOBAL STATUS                                                                                   |    0.000 |
| 10187170 | nagios          | localhost        | NULL               | Query   |     138 | init                             | show slave status                                                                                    |    0.000 |
| 10187171 | nagios          | localhost        | NULL               | Query   |     138 | init                             | show slave status                                                                                    |    0.000 |
| 10187173 | nagios          | localhost        | NULL               | Query   |     136 | init                             | show slave status                                                                                    |    0.000 |
| 10187174 | nagios          | localhost        | NULL               | Query   |     136 | init                             | show slave status                                                                                    |    0.000 |
| 10187175 | nagios          | localhost        | NULL               | Query   |     135 | init                             | show slave status                                                                                    |    0.000 |
| 10187176 | nagios          | localhost        | NULL               | Query   |     135 | init                             | show slave status                                                                                    |    0.000 |
| 10187177 | nagios          | localhost        | NULL               | Query   |     135 | init                             | show slave status                                                                                    |    0.000 |
| 10187178 | nagios          | localhost        | NULL               | Query   |     135 | init                             | show slave status                                                                                    |    0.000 |
| 10187179 | nagios          | localhost        | NULL               | Query   |     134 | init                             | show slave status                                                                                    |    0.000 |
| 10187180 | nagios          | localhost        | NULL               | Query   |     132 | init                             | show slave status                                                                                    |    0.000 |
| 10187182 | nagios          | localhost        | NULL               | Query   |     130 | init                             | show slave status                                                                                    |    0.000 |
| 10187183 | prometheus      | localhost        | NULL               | Query   |     129 | Filling schema table             | SHOW GLOBAL STATUS                                                                                   |    0.000 |
| 10187184 | nagios          | localhost        | NULL               | Query   |     129 | init                             | show slave status                                                                                    |    0.000 |
| 10187185 | nagios          | localhost        | NULL               | Query   |     128 | init                             | show slave status                                                                                    |    0.000 |
| 10187186 | nagios          | localhost        | NULL               | Query   |     128 | init                             | show slave status                                                                                    |    0.000 |
| 10187187 | nagios          | localhost        | NULL               | Query   |     128 | init                             | show slave status                                                                                    |    0.000 |
| 10187188 | nagios          | localhost        | NULL               | Query   |     128 | init                             | show slave status                                                                                    |    0.000 |
| 10187189 | nagios          | localhost        | NULL               | Query   |     127 | init                             | show slave status                                                                                    |    0.000 |
| 10187190 | nagios          | localhost        | NULL               | Query   |     127 | init                             | show slave status                                                                                    |    0.000 |
| 10187192 | watchdog        | 10.64.0.15:55239 | information_schema | Query   |     100 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_VARIABLES`                                     |    0.000 |
| 10187193 | nagios          | localhost        | NULL               | Query   |     125 | init                             | show slave status                                                                                    |    0.000 |
| 10187194 | nagios          | localhost        | NULL               | Query   |     124 | init                             | show slave status                                                                                    |    0.000 |
| 10187195 | nagios          | localhost        | NULL               | Query   |     123 | init                             | show slave status                                                                                    |    0.000 |
| 10187196 | nagios          | localhost        | NULL               | Query   |     123 | init                             | show slave status                                                                                    |    0.000 |
| 10187197 | nagios          | localhost        | NULL               | Query   |     122 | init                             | show slave status                                                                                    |    0.000 |
| 10187198 | nagios          | localhost        | NULL               | Query   |     122 | init                             | show slave status                                                                                    |    0.000 |
| 10187199 | nagios          | localhost        | NULL               | Query   |     122 | init                             | show slave status                                                                                    |    0.000 |
| 10187200 | nagios          | localhost        | NULL               | Query   |     122 | init                             | show slave status                                                                                    |    0.000 |
| 10187202 | nagios          | localhost        | NULL               | Query   |     119 | init                             | show slave status                                                                                    |    0.000 |
| 10187203 | watchdog        | 10.64.0.15:55486 | information_schema | Query   |     119 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_STATUS`                                        |    0.000 |
| 10187205 | nagios          | localhost        | NULL               | Query   |     116 | init                             | show slave status                                                                                    |    0.000 |
| 10187206 | nagios          | localhost        | NULL               | Query   |     116 | init                             | show slave status                                                                                    |    0.000 |
| 10187207 | nagios          | localhost        | NULL               | Query   |     115 | init                             | show slave status                                                                                    |    0.000 |
| 10187208 | nagios          | localhost        | NULL               | Query   |     114 | init                             | show slave status                                                                                    |    0.000 |
| 10187209 | nagios          | localhost        | NULL               | Query   |     114 | init                             | show slave status                                                                                    |    0.000 |
| 10187210 | nagios          | localhost        | NULL               | Query   |     114 | init                             | show slave status                                                                                    |    0.000 |
| 10187213 | nagios          | localhost        | NULL               | Query   |     109 | init                             | show slave status                                                                                    |    0.000 |
| 10187214 | nagios          | localhost        | NULL               | Query   |     109 | init                             | show slave status                                                                                    |    0.000 |
| 10187215 | nagios          | localhost        | NULL               | Query   |     109 | init                             | show slave status                                                                                    |    0.000 |
| 10187216 | nagios          | localhost        | NULL               | Query   |     109 | init                             | show slave status                                                                                    |    0.000 |
| 10187217 | nagios          | localhost        | NULL               | Query   |     107 | init                             | show slave status                                                                                    |    0.000 |
| 10187218 | nagios          | localhost        | NULL               | Query   |     107 | init                             | show slave status                                                                                    |    0.000 |
| 10187222 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187223 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187224 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187225 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187226 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187227 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187228 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187229 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187230 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187231 | nagios          | localhost        | NULL               | Query   |      92 | init                             | show slave status                                                                                    |    0.000 |
| 10187233 | watchdog        | 10.64.0.15:56499 | information_schema | Query   |      89 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_STATUS`                                        |    0.000 |
| 10187234 | nagios          | localhost        | NULL               | Query   |      88 | init                             | show slave status                                                                                    |    0.000 |
| 10187235 | nagios          | localhost        | NULL               | Query   |      88 | init                             | show slave status                                                                                    |    0.000 |
| 10187236 | nagios          | localhost        | NULL               | Query   |      88 | init                             | show slave status                                                                                    |    0.000 |
| 10187237 | nagios          | localhost        | NULL               | Query   |      87 | init                             | show slave status                                                                                    |    0.000 |
| 10187238 | nagios          | localhost        | NULL               | Query   |      87 | init                             | show slave status                                                                                    |    0.000 |
| 10187239 | nagios          | localhost        | NULL               | Query   |      87 | init                             | show slave status                                                                                    |    0.000 |
| 10187241 | nagios          | localhost        | NULL               | Query   |      85 | init                             | show slave status                                                                                    |    0.000 |
| 10187242 | nagios          | localhost        | NULL               | Query   |      83 | init                             | show slave status                                                                                    |    0.000 |
| 10187243 | nagios          | localhost        | NULL               | Query   |      83 | init                             | show slave status                                                                                    |    0.000 |
| 10187245 | nagios          | localhost        | NULL               | Query   |      81 | init                             | show slave status                                                                                    |    0.000 |
| 10187246 | nagios          | localhost        | NULL               | Query   |      81 | init                             | show slave status                                                                                    |    0.000 |
| 10187247 | prometheus      | localhost        | NULL               | Query   |      78 | Filling schema table             | SHOW GLOBAL STATUS                                                                                   |    0.000 |
| 10187248 | nagios          | localhost        | NULL               | Query   |      77 | init                             | show slave status                                                                                    |    0.000 |
| 10187249 | nagios          | localhost        | NULL               | Query   |      77 | init                             | show slave status                                                                                    |    0.000 |
| 10187251 | nagios          | localhost        | NULL               | Query   |      76 | init                             | show slave status                                                                                    |    0.000 |
| 10187252 | nagios          | localhost        | NULL               | Query   |      76 | init                             | show slave status                                                                                    |    0.000 |
| 10187253 | nagios          | localhost        | NULL               | Query   |      75 | init                             | show slave status                                                                                    |    0.000 |
| 10187254 | nagios          | localhost        | NULL               | Query   |      75 | init                             | show slave status                                                                                    |    0.000 |
| 10187255 | nagios          | localhost        | NULL               | Query   |      75 | init                             | show slave status                                                                                    |    0.000 |
| 10187256 | nagios          | localhost        | NULL               | Query   |      75 | init                             | show slave status                                                                                    |    0.000 |
| 10187257 | nagios          | localhost        | NULL               | Query   |      75 | init                             | show slave status                                                                                    |    0.000 |
| 10187258 | nagios          | localhost        | NULL               | Query   |      72 | init                             | show slave status                                                                                    |    0.000 |
| 10187260 | nagios          | localhost        | NULL               | Query   |      70 | init                             | show slave status                                                                                    |    0.000 |
| 10187262 | nagios          | localhost        | NULL               | Query   |      69 | init                             | show slave status                                                                                    |    0.000 |
| 10187263 | nagios          | localhost        | NULL               | Query   |      68 | init                             | show slave status                                                                                    |    0.000 |
| 10187264 | nagios          | localhost        | NULL               | Query   |      68 | init                             | show slave status                                                                                    |    0.000 |
| 10187265 | nagios          | localhost        | NULL               | Query   |      68 | init                             | show slave status                                                                                    |    0.000 |
| 10187266 | nagios          | localhost        | NULL               | Query   |      68 | init                             | show slave status                                                                                    |    0.000 |
| 10187267 | nagios          | localhost        | NULL               | Query   |      67 | init                             | show slave status                                                                                    |    0.000 |
| 10187268 | nagios          | localhost        | NULL               | Query   |      67 | init                             | show slave status                                                                                    |    0.000 |
| 10187270 | watchdog        | 10.64.0.15:57272 | information_schema | Query   |      40 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_VARIABLES`                                     |    0.000 |
| 10187271 | nagios          | localhost        | NULL               | Query   |      65 | init                             | show slave status                                                                                    |    0.000 |
| 10187272 | nagios          | localhost        | NULL               | Query   |      64 | init                             | show slave status                                                                                    |    0.000 |
| 10187273 | nagios          | localhost        | NULL               | Query   |      64 | init                             | show slave status                                                                                    |    0.000 |
| 10187274 | nagios          | localhost        | NULL               | Query   |      63 | init                             | show slave status                                                                                    |    0.000 |
| 10187276 | nagios          | localhost        | NULL               | Query   |      61 | init                             | show slave status                                                                                    |    0.000 |
| 10187277 | nagios          | localhost        | NULL               | Query   |      61 | init                             | show slave status                                                                                    |    0.000 |
| 10187278 | nagios          | localhost        | NULL               | Query   |      61 | init                             | show slave status                                                                                    |    0.000 |
| 10187279 | nagios          | localhost        | NULL               | Query   |      61 | init                             | show slave status                                                                                    |    0.000 |
| 10187280 | nagios          | localhost        | NULL               | Query   |      59 | init                             | show slave status                                                                                    |    0.000 |
| 10187281 | watchdog        | 10.64.0.15:57518 | information_schema | Query   |      59 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_STATUS`                                        |    0.000 |
| 10187282 | root            | localhost        | NULL               | Killed  |      53 | init                             | show all slaves status                                                                               |    0.000 |
| 10187284 | nagios          | localhost        | NULL               | Query   |      56 | init                             | show slave status                                                                                    |    0.000 |
| 10187285 | nagios          | localhost        | NULL               | Query   |      55 | init                             | show slave status                                                                                    |    0.000 |
| 10187286 | nagios          | localhost        | NULL               | Query   |      55 | init                             | show slave status                                                                                    |    0.000 |
| 10187287 | nagios          | localhost        | NULL               | Query   |      53 | init                             | show slave status                                                                                    |    0.000 |
| 10187288 | nagios          | localhost        | NULL               | Query   |      53 | init                             | show slave status                                                                                    |    0.000 |
| 10187289 | nagios          | localhost        | NULL               | Query   |      53 | init                             | show slave status                                                                                    |    0.000 |
| 10187291 | nagios          | localhost        | NULL               | Query   |      50 | init                             | show slave status                                                                                    |    0.000 |
| 10187292 | nagios          | localhost        | NULL               | Query   |      50 | init                             | show slave status                                                                                    |    0.000 |
| 10187293 | nagios          | localhost        | NULL               | Query   |      50 | init                             | show slave status                                                                                    |    0.000 |
| 10187294 | nagios          | localhost        | NULL               | Query   |      50 | init                             | show slave status                                                                                    |    0.000 |
| 10187295 | nagios          | localhost        | NULL               | Query   |      48 | init                             | show slave status                                                                                    |    0.000 |
| 10187296 | nagios          | localhost        | NULL               | Query   |      47 | init                             | show slave status                                                                                    |    0.000 |
| 10187302 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187303 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187304 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187305 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187306 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187307 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187308 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187309 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187310 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187311 | nagios          | localhost        | NULL               | Query   |      32 | init                             | show slave status                                                                                    |    0.000 |
| 10187313 | watchdog        | 10.64.0.15:58527 | information_schema | Query   |      29 | Filling schema table             | SELECT `VARIABLE_NAME`, `VARIABLE_VALUE` FROM `GLOBAL_STATUS`                                        |    0.000 |
| 10187314 | nagios          | localhost        | NULL               | Query   |      28 | init                             | show slave status                                                                                    |    0.000 |
| 10187315 | nagios          | localhost        | NULL               | Query   |      28 | init                             | show slave status                                                                                    |    0.000 |
| 10187316 | nagios          | localhost        | NULL               | Query   |      28 | init                             | show slave status                                                                                    |    0.000 |
| 10187318 | nagios          | localhost        | NULL               | Query   |      26 | init                             | show slave status                                                                                    |    0.000 |
| 10187319 | nagios          | localhost        | NULL               | Query   |      26 | init                             | show slave status                                                                                    |    0.000 |
| 10187320 | nagios          | localhost        | NULL               | Query   |      26 | init                             | show slave status                                                                                    |    0.000 |
| 10187321 | nagios          | localhost        | NULL               | Query   |      25 | init                             | show slave status                                                                                    |    0.000 |
| 10187322 | nagios          | localhost        | NULL               | Query   |      24 | init                             | show slave status                                                                                    |    0.000 |
| 10187323 | nagios          | localhost        | NULL               | Query   |      24 | init                             | show slave status                                                                                    |    0.000 |
| 10187325 | nagios          | localhost        | NULL               | Query   |      21 | init                             | show slave status                                                                                    |    0.000 |
| 10187326 | nagios          | localhost        | NULL               | Query   |      21 | init                             | show slave status                                                                                    |    0.000 |
| 10187328 | nagios          | localhost        | NULL               | Query   |      17 | init                             | show slave status                                                                                    |    0.000 |
| 10187329 | nagios          | localhost        | NULL               | Query   |      17 | init                             | show slave status                                                                                    |    0.000 |
| 10187331 | nagios          | localhost        | NULL               | Query   |      16 | init                             | show slave status                                                                                    |    0.000 |
| 10187332 | nagios          | localhost        | NULL               | Query   |      16 | init                             | show slave status                                                                                    |    0.000 |
| 10187333 | nagios          | localhost        | NULL               | Query   |      15 | init                             | show slave status                                                                                    |    0.000 |
| 10187334 | nagios          | localhost        | NULL               | Query   |      15 | init                             | show slave status                                                                                    |    0.000 |
| 10187335 | nagios          | localhost        | NULL               | Query   |      15 | init                             | show slave status                                                                                    |    0.000 |
| 10187336 | nagios          | localhost        | NULL               | Query   |      15 | init                             | show slave status                                                                                    |    0.000 |
| 10187337 | nagios          | localhost        | NULL               | Query   |      15 | init                             | show slave status                                                                                    |    0.000 |
| 10187339 | nagios          | localhost        | NULL               | Query   |      11 | init                             | show slave status                                                                                    |    0.000 |
| 10187341 | nagios          | localhost        | NULL               | Query   |       9 | init                             | show slave status                                                                                    |    0.000 |
| 10187342 | nagios          | localhost        | NULL               | Query   |       8 | init                             | show slave status                                                                                    |    0.000 |
| 10187344 | nagios          | localhost        | NULL               | Query   |       8 | init                             | show slave status                                                                                    |    0.000 |
| 10187345 | nagios          | localhost        | NULL               | Query   |       8 | init                             | show slave status                                                                                    |    0.000 |
| 10187346 | nagios          | localhost        | NULL               | Query   |       8 | init                             | show slave status                                                                                    |    0.000 |
| 10187347 | nagios          | localhost        | NULL               | Query   |       8 | init                             | show slave status                                                                                    |    0.000 |
| 10187348 | root            | localhost        | NULL               | Sleep   |       3 |                                  | NULL                                                                                                 |    0.000 |
| 10187349 | nagios          | localhost        | NULL               | Query   |       7 | init                             | show slave status                                                                                    |    0.000 |
| 10187350 | nagios          | localhost        | NULL               | Query   |       7 | init                             | show slave status                                                                                    |    0.000 |
| 10187352 | watchdog        | 10.64.0.15:59297 | information_schema | Sleep   |       6 |                                  | NULL                                                                                                 |    0.000 |
| 10187353 | nagios          | localhost        | NULL               | Query   |       5 | init                             | show slave status                                                                                    |    0.000 |
| 10187354 | root            | localhost        | NULL               | Query   |       0 | init                             | show processlist                                                                                     |    0.000 |
| 10187355 | nagios          | localhost        | NULL               | Query   |       4 | init                             | show slave status                                                                                    |    0.000 |
+----------+-----------------+------------------+--------------------+---------+---------+----------------------------------+------------------------------------------------------------------------------------------------------+----------+
262 rows in set (0.00 sec)
root@DBSTORE[(none)]> kill 10187049;
Query OK, 0 rows affected (0.00 sec)

root@DBSTORE[(none)]> show processlist;
+----------+-----------------+------------------+--------------------+---------+---------+----------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id       | User            | Host             | db                 | Command | Time    | State                            | Info                                                                                                 | Progress |
+----------+-----------------+------------------+--------------------+---------+---------+----------------------------------+------------------------------------------------------------------------------------------------------+----------+
|  7927022 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927024 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927026 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927028 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927030 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927032 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927034 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927036 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927038 | system user     |                  | NULL               | Connect | 1354961 | Waiting for master to send event | NULL                                                                                                 |    0.000 |
|  7927050 | event_scheduler | localhost        | NULL               | Daemon  |       0 | Waiting for next activation      | NULL                                                                                                 |    0.000 |
| 10107012 | watchdog        | 10.64.0.15:48258 | mysql              | Sleep   |       5 |                                  | NULL                                                                                                 |    0.000 |
| 10142049 | watchdog        | 10.64.0.15:34909 | information_schema | Sleep   |      85 |                                  | NULL                                                                                                 |    0.000 |
| 10187348 | root            | localhost        | NULL               | Sleep   |       2 |                                  | NULL                                                                                                 |    0.000 |
| 10187352 | watchdog        | 10.64.0.15:59297 | information_schema | Sleep   |      75 |                                  | NULL                                                                                                 |    0.000 |
| 10187354 | root            | localhost        | NULL               | Query   |       0 | init                             | show processlist                                                                                     |    0.000 |
| 10187443 | system user     |                  | NULL               | Connect |   86703 | closing tables                   | NULL                                                                                                 |    0.000 |
| 10187444 | system user     |                  | cswiki             | Connect |   86899 | update                           | INSERT /* LinksUpdate::incrTableUpdate  */ IGNORE INTO `templatelinks` (tl_from,tl_from_namespace,tl |    0.000 |
| 10187445 | system user     |                  | kawiki             | Connect |   86676 | update                           | INSERT /* Revision::insertOn  */  INTO `revision` (rev_id,rev_page,rev_text_id,rev_comment,rev_minor |    0.000 |
| 10187446 | system user     |                  | NULL               | Connect |   86861 | closing tables                   | NULL                                                                                                 |    0.000 |
| 10187447 | system user     |                  | dewiki             | Connect |   86810 | updating                         | UPDATE /* HTMLCacheUpdateJob::invalidateTitles  */  `page` SET page_touched = 'sanitized' WHERE |    0.000 |
| 10187448 | system user     |                  | ruwiki             | Connect |   86697 | updating                         | UPDATE /* MediaWiki\Auth\TemporaryPasswordPrimaryAuthenticationProvider::providerChangeAuthenticatio |    0.000 |
| 10187449 | system user     |                  | NULL               | Connect |   86787 | closing tables                   | NULL                                                                                                 |    0.000 |
+----------+-----------------+------------------+--------------------+---------+---------+----------------------------------+------------------------------------------------------------------------------------------------------+----------+
22 rows in set (0.00 sec)

I have skipped that query, which is simply related to the pt-table-checksum ran from yesterday and restarted replication. I executed the checksum a few times, so we will see if it leaves the server in the same status for the next iterations. I will closely monitor it today.

I wonder why it has not happened on dbstore1002 or dbstore2001 though.

Given this on dbstore1001: 0 1 * * 3 /usr/local/bin/dumps-misc.sh >/srv/dumps-misc.log 2>&1 means that it might have collided with a backup being taken at the same time as the replace was running for 77588 seconds which is...21 hours.

The dbstore2001 issue never happened again. It only happened once and I tested pt-table-checksum several times during the day so, if it was something in particular it should've happened at least 7-8 times during yesterday. So those are good news!

Change 338326 had a related patch set uploaded (by Marostegui):
generate_dsns_table.sh: Add new script for dsns.

https://gerrit.wikimedia.org/r/338326

I have populated all the dsns tables, excluding db1069, db1095 and dbstore servers. Basically whatever is on the .hosts files but those hosts.

# for i in `mysql --skip-ssl dsns -e "show tables;" -BN`; do echo $i; mysql --skip-ssl dsns -e "select count(*) from $i;";done
dsns_m3
+----------+
| count(*) |
+----------+
|        2 |
+----------+
dsns_s1
+----------+
| count(*) |
+----------+
|       19 |
+----------+
dsns_s2
+----------+
| count(*) |
+----------+
|       18 |
+----------+
dsns_s3
+----------+
| count(*) |
+----------+
|       11 |
+----------+
dsns_s4
+----------+
| count(*) |
+----------+
|       14 |
+----------+
dsns_s5
+----------+
| count(*) |
+----------+
|       13 |
+----------+
dsns_s6
+----------+
| count(*) |
+----------+
|       14 |
+----------+
dsns_s7
+----------+
| count(*) |
+----------+
|       14 |
+----------+

I have done so with a small shell script called generated_dsns_table.sh which will be pushed to the dbtools directory on the operations/software repo: https://gerrit.wikimedia.org/r/#/c/338326/

Change 338326 merged by Jcrespo:
generate_dsns_table.sh: Add new script for dsns.

https://gerrit.wikimedia.org/r/338326

Not needed now, probably x1 will be needed. Also we should start thinking about merging the hosts lists into a dynamic table + maintaining that update, as when we decom. older hosts, old masters will shift to be slaves (and we will forget to change all this tables :-/). This is more of the inventory + automatic checking tasks, not a concern right now.

Marostegui added a comment.EditedFeb 17 2017, 2:42 PM

I am thinking about running this first iteration over s2 and a concrete database next week at some point: nlwiki. (s2)
Why nlwiki? because it is around 100G and the biggest table it has is revision with 23G, which can be a good test.

From the tables it has the following tables would need to be ignored (as they do not have a PK, and the list is the same on all the hosts on s2):

categorylinks
change_tag
click_tracking
click_tracking_user_properties
cur
edit_page_tracking
ep_users_per_course
hidden
imagelinks
interwiki
iwlinks
l10n_cache
langlinks
linkscc
log_search
logging_pre_1_10
math
module_deps
msg_resource
msg_resource_links
objectcache
oldimage
pagelinks
prefstats
prefswitch_survey
querycache
querycache_info
querycachetwo
securepoll_lists
securepoll_msgs
securepoll_properties
site_identifiers
site_stats
tag_summary
templatelinks
text
transcache
user_former_groups
user_newtalk
user_properties
watchlist

Excluding that list, we'd still have revision (23G) and logging (17G) as good tests.
If we generated delay, the following wikis would be affected:

bgwiki
cswiki
enwikiquote
eowiki
fiwiki
idwiki
itwiki
l10nwiki
nlwiki
nowiki
plwiki
ptwiki
svwiki
thwiki
trwiki
zhwiki

I will probably run another test on m3 with a chunk-size of 100 instead of 200 and see if we get even less delay that the one we got (reminder: peak of 12-13 seconds for around 20-30 seconds)

We have semisync on core, I would expect much less impact, lag-wise.

Marostegui added a comment.EditedFeb 20 2017, 1:48 PM

The idea would be to run the pt-table-checksum with the following tables on the same batch for nlwiki:

-rw-rw---- 1 mysql mysql 112K Feb 20  2014 ep_instructors.ibd
-rw-rw---- 1 mysql mysql 112K Feb 11 17:38 babel.ibd
-rw-rw---- 1 mysql mysql 128K Mar 16  2014 ep_oas.ibd
-rw-rw---- 1 mysql mysql 128K Mar 16  2014 ep_cas.ibd
-rw-rw---- 1 mysql mysql 144K Oct 21 14:27 ores_model.ibd
-rw-rw---- 1 mysql mysql 176K Feb 18 18:32 image.ibd
-rw-rw---- 1 mysql mysql 176K Mar 31  2015 ep_articles.ibd
-rw-rw---- 1 mysql mysql 208K May 24  2015 ep_students.ibd
-rw-rw---- 1 mysql mysql 240K Mar 27  2015 ep_orgs.ibd
-rw-rw---- 1 mysql mysql 272K Mar 27  2015 ep_revisions.ibd
-rw-rw---- 1 mysql mysql 320K Mar 27  2015 ep_courses.ibd
-rw-rw---- 1 mysql mysql 784K Dec  1 03:26 sites.ibd
-rw-rw---- 1 mysql mysql 1.0M Nov 13 11:42 valid_tag.ibd
-rw-rw---- 1 mysql mysql 1.0M Nov 28 13:58 uploadstash.ibd
-rw-rw---- 1 mysql mysql 1.0M Jan  6 06:10 updatelog.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 transcode.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 trackbacks.ibd
-rw-rw---- 1 mysql mysql 1.0M Feb 19 10:35 protected_titles.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 pr_index.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 povwatch_subscribers.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 povwatch_log.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 job.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 global_block_whitelist.ibd
-rw-rw---- 1 mysql mysql 1.0M Sep 25  2013 filejournal.ibd
-rw-rw---- 1 mysql mysql 1.0M Feb 20 13:14 betafeatures_user_counts.ibd
-rw-rw---- 1 mysql mysql 1.0M Feb 20 12:22 abuse_filter.ibd
-rw-rw---- 1 mysql mysql 1.0M Feb 19 16:07 abuse_filter_action.ibd
-rw-rw---- 1 mysql mysql 8.0M Feb 17 19:47 user_groups.ibd
-rw-rw---- 1 mysql mysql 9.0M Sep 24  2013 pif_edits.ibd
-rw-rw---- 1 mysql mysql 9.0M Nov 27 02:58 moodbar_feedback_response.ibd
-rw-rw---- 1 mysql mysql 9.0M Feb 19 16:07 abuse_filter_history.ibd
-rw-rw---- 1 mysql mysql  10M Feb 19 15:55 page_restrictions.ibd
-rw-rw---- 1 mysql mysql  11M May 24  2015 ep_events.ibd
-rw-rw---- 1 mysql mysql  12M Dec 13 03:20 moodbar_feedback.ibd
-rw-rw---- 1 mysql mysql  12M Feb 19 19:56 cu_log.ibd
-rw-rw---- 1 mysql mysql  16M Sep 24  2013 bv2009_edits.ibd
-rw-rw---- 1 mysql mysql  20M Sep 24  2013 bv2011_edits.ibd
-rw-rw---- 1 mysql mysql  24M Sep 24  2013 blob_orphans.ibd
-rw-rw---- 1 mysql mysql  29M May  4  2015 bv2015_edits.ibd
-rw-rw---- 1 mysql mysql  30M Feb 20 13:44 category.ibd
-rw-rw---- 1 mysql mysql  31M Feb 20 13:15 ipblocks.ibd
-rw-rw---- 1 mysql mysql  36M Feb 20 13:43 geo_tags.ibd
-rw-rw---- 1 mysql mysql  76M Feb 20 13:42 spoofuser.ibd
-rw-rw---- 1 mysql mysql  80M Feb 20 13:09 ores_classification.ibd
-rw-rw---- 1 mysql mysql  88M Feb 18 18:32 filearchive.ibd
-rw-rw---- 1 mysql mysql 108M Feb 20 13:29 redirect.ibd
-rw-rw---- 1 mysql mysql 304M Feb 20 13:44 user.ibd
-rw-rw---- 1 mysql mysql 404M Feb 20 13:40 abuse_filter_log.ibd
-rw-rw---- 1 mysql mysql 496M Oct 31  2014 titlekey.ibd
-rw-rw---- 1 mysql mysql 604M Mar 15  2016 updates.ibd
-rw-rw---- 1 mysql mysql 672M Feb 20 13:34 page_props.ibd
-rw-rw---- 1 mysql mysql 800M Feb 20 13:16 wbc_entity_usage.ibd
-rw-rw---- 1 mysql mysql 1.1G Feb 20 13:45 page.ibd
-rw-rw---- 1 mysql mysql 1.1G Sep 24  2013 blob_tracking.ibd
-rw-rw---- 1 mysql mysql 1.3G Feb 20 13:39 archive.ibd
-rw-rw---- 1 mysql mysql 1.7G Feb 20 13:45 recentchanges.ibd
-rw-rw---- 1 mysql mysql 1.7G Feb 20 13:45 cu_changes.ibd
-rw-rw---- 1 mysql mysql 5.0G Feb 20 13:44 externallinks.ibd

And then run an individual "per table" check for the two biggest tables on different runs:

-rw-rw---- 1 mysql mysql  17G Feb 20 13:44 logging.ibd
-rw-rw---- 1 mysql mysql  23G Feb 20 13:45 revision.ibd

The idea of doing it in different batches is to try to minimize possible lags and overload, at least on this initial tests till we are sure how to tweak the runs and the table sizes.

Mentioned in SAL (#wikimedia-operations) [2017-02-22T09:14:36Z] <marostegui> Run pt-table-checksum on s2.nlwiki over some tables - T154485

I am running the following command now for s2 for the first iteration of testing with real data and real tables (I tested it first with just one table abuse_filter_action and it went fine):

./pt-table-checksum --no-check-binlog-format --replicate=nlwiki.__wmf_checksums --check-slave-lag db2017.codfw.wmnet,db2035.codfw.wmnet,db2041.codfw.wmnet,db2049.codfw.wmnet,db1024.eqiad.wmnet,db1021.eqiad.wmnet,db1036.eqiad.wmnet,db1054.eqiad.wmnet,db1060.eqiad.wmnet --max-lag 1 --chunk-size 100  -F /home/marostegui/.my.cnf  --databases nlwiki --tables abuse_filter_action,abuse_filter_history,abuse_filter,abuse_filter_log,archive,babel,betafeatures_user_counts,blob_orphans,blob_tracking,bv2009_edits,bv2011_edits,bv2015_edits,category,cu_changes,cu_log,ep_articles,ep_cas,ep_courses,ep_events,ep_instructors,ep_oas,ep_orgs,ep_students,externallinks,filearchive,filejournal,geo_tags,global_block_whitelist,image,ipblocks,job,moodbar_feedback,moodbar_feedback_response,ores_classification,ores_model,page,page_props,page_restrictions,pif_edits,povwatch_log,povwatch_subscribers,pr_index,protected_titles,recentchanges,redirect,sites,spoofuser,titlekey,trackbacks,transcode,updatelog,updates,uploadstash,user_groups,user,valid_tag,wbc_entity_usage h=db1018.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_s2

revision and logging tables are excluded, and will be done on individual batches for now.
The biggest table of this batch is a 5GB one externalinks

The first iteration of the above run finished:

  • No lag generated
  • 1 error with db1047
02-22T10:20:40 Skipping table nlwiki.protected_titles because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
  236 rows on db1047
  • Differences found on db1036 and db1047 only:
root@neodymium:/home/marostegui/percona-toolkit-2.2.20/bin# ./pt-table-checksum --no-check-binlog-format --replicate-check-only --replicate=nlwiki.__wmf_checksums --check-slave-lag db2017.codfw.wmnet,db2035.codfw.wmnet,db2041.codfw.wmnet,db2049.codfw.wmnet,db1024.eqiad.wmnet,db1021.eqiad.wmnet,db1036.eqiad.wmnet,db1054.eqiad.wmnet,db1060.eqiad.wmnet --max-lag 1 --chunk-size 100  -F /home/marostegui/.my.cnf  --databases nlwiki --tables abuse_filter_action,abuse_filter_history,abuse_filter,abuse_filter_log,archive,babel,betafeatures_user_counts,blob_orphans,blob_tracking,bv2009_edits,bv2011_edits,bv2015_edits,category,cu_changes,cu_log,ep_articles,ep_cas,ep_courses,ep_events,ep_instructors,ep_oas,ep_orgs,ep_students,externallinks,filearchive,filejournal,geo_tags,global_block_whitelist,image,ipblocks,job,moodbar_feedback,moodbar_feedback_response,ores_classification,ores_model,page,page_props,page_restrictions,pif_edits,povwatch_log,povwatch_subscribers,pr_index,protected_titles,recentchanges,redirect,sites,spoofuser,titlekey,trackbacks,transcode,updatelog,updates,uploadstash,user_groups,user,valid_tag,wbc_entity_usage h=db1018.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_s2
Differences on db1036
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nlwiki.archive 24161 0 1 PRIMARY 4033673 4033804
nlwiki.archive 24388 0 1 PRIMARY 4070658 4070797
nlwiki.archive 24787 0 1 PRIMARY 4135363 4135480
nlwiki.archive 24788 0 1 PRIMARY 4135481 4135580
nlwiki.archive 26405 0 1 PRIMARY 4375953 4376257
nlwiki.archive 27464 0 1 PRIMARY 4529887 4530040
nlwiki.archive 27654 0 1 PRIMARY 4557604 4557744
nlwiki.archive 28007 0 1 PRIMARY 4606233 4606337

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nlwiki.archive 11253 0 1 PRIMARY 1881326 1881425
nlwiki.archive 23777 0 1 PRIMARY 3974940 3975051

Significant table times (in seconds) and size:

375.203 nlwiki.archive - 1.3G
821.313 nlwiki.blob_tracking - 1.1G
765.786 nlwiki.externallinks - 5G
499.988 nlwiki.page_props - 672M
316.525 nlwiki.recentchanges - 1.7G
358.347 nlwiki.titlekey - 496M
486.890 nlwiki.updates - 604M
388.817 nlwiki.wbc_entity_usage - 800M

Given that no lag has been generated at all, chunk-size=100 is probably the way to go. We should try to go now for logging and revision tables on different batches too.

For the record:
Replication got broken on db1069 (sanitarium) with:

Last_Error: Error 'Table 'nlwiki.pr_index' doesn't exist' on query. Default database: 'nlwiki'. Query: 'REPLACE INTO `nlwiki`.`__wmf_checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT 'nlwiki', 'pr_index', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `pr_page_id`, `pr_count`, `pr_q0`, `pr_q1`, `pr_q2`, `pr_q3`, `pr_q4`)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `nlwiki`.`pr_index` /*checksum table*/'

That happened for all the tables that do exist on production but not in sanitarium/labs

Maybe we should ignore all edits on '%.__wmf_checksums' ? After all, if writes happen on the master, this method will not catch differences due to labs's master using row format...

Maybe we should ignore all edits on '%.__wmf_checksums' ? After all, if writes happen on the master, this method will not catch differences due to labs's master using row format...

Yes, it doesn't make much sense to have it on labs/sanitarium anyways as data itself will mostly be different due to the sanitization

Change 339182 had a related patch set uploaded (by Marostegui):
realm.pp: Add __wmf_checksum table to be ignored

https://gerrit.wikimedia.org/r/339182

Change 339182 merged by Marostegui:
realm.pp: Add __wmf_checksums table to be ignored

https://gerrit.wikimedia.org/r/339182

Mentioned in SAL (#wikimedia-operations) [2017-02-22T15:11:57Z] <marostegui> Restart MySQL on db1069 to apply new replication filters - https://phabricator.wikimedia.org/T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-22T15:21:46Z] <marostegui> Restart MySQL on db1095 to apply new replication filters - https://phabricator.wikimedia.org/T154485

I have restarted db1069 and db1095 and they have the new filter applied:

%.__wmf_checksums

Mentioned in SAL (#wikimedia-operations) [2017-02-23T07:59:14Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on logging table - T154485

I have started to run pt-table-checksum only for the logging table (17GB) and the expected time reported by pt-table-checksum is around 40 minutes. I am closely monitoring lag on the shard.
After that if all goes fine I will go for revision (23G)

It finished successfully in 44 minutes:

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:45:43      0      0 26747597  267478       0 2668.835 nlwiki.logging

I am going to go for revision now.

Mentioned in SAL (#wikimedia-operations) [2017-02-23T08:51:04Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on revision table - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-23T08:54:38Z] <marostegui> Stop pt-table-checksum on nlwiki.revision - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-23T09:09:27Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-23T09:23:36Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s)

Some seconds after starting the pt-table-checksum for the revision table I had to kill it as both recentchanges slaves in eqiad and codfw (with the revision table partitioned of course) started to lag behind, to the point where I have had to decommission db1036 as it is lagging behind (1000 seconds now).
logging table is also partitioned and there was no lag there in any of those two hosts., so maybe for revision tables we have to go for a lower chunk (it is 100 now). The ETA for revision with chunk-size was 1:22h.
The crtl+c showed:

Checksumming nlwiki.revision:   0% 01:22:15 remain
^C# Caught SIGINT.
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:54:18      0      0   504100    5041       0  56.792 nlwiki.revision
TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:54:18      0      0   504100    5041

Which indeed shows chunks of 100 rows.

Once the server have caught up we can try to decrease the chunks to 10, keep db1036 depooled (db1063 is serving recentchanges now) and see what happens.

db1036 and db2035 were still lagging behind more than 30k:

˜/jynus 19:26> it is now doing a full table scan for every chunk

So we decided to add a replication filter on those two hosts to ignore that table and let the servers catch up:

root@PRODUCTION s2[(none)]> stop slave;
Query OK, 0 rows affected (26.81 sec)

root@PRODUCTION s2[(none)]> set global Replicate_Wild_Ignore_Table = "nlwiki.__wmf_checksums";
Query OK, 0 rows affected (0.00 sec)

root@PRODUCTION s2[(none)]> start slave;
Query OK, 0 rows affected (0.72 sec)

db1036 and db2035 were still lagging behind more than 30k:

˜/jynus 19:26> it is now doing a full table scan for every chunk

So we decided to add a replication filter on those two hosts to ignore that table and let the servers catch up:

root@PRODUCTION s2[(none)]> stop slave;
Query OK, 0 rows affected (26.81 sec)

root@PRODUCTION s2[(none)]> set global Replicate_Wild_Ignore_Table = "nlwiki.__wmf_checksums";
Query OK, 0 rows affected (0.00 sec)

root@PRODUCTION s2[(none)]> start slave;
Query OK, 0 rows affected (0.72 sec)

I am tempted to leave those replication filters there and try to run the pt-table-checksum again on Monday to see, if at least, it is a viable option to get the whole schema tested without any lag on the revision table (obviously those two slaves wouldn't get checked for consistency).

I wouldn't be against it, but I would ask to start puppetizing that- Something like a replication filter, and add those through hiera with a new production.my option. If the servers restart, we will create issues (I think replication options are lost).

I wouldn't be against it, but I would ask to start puppetizing that- Something like a replication filter, and add those through hiera with a new production.my option. If the servers restart, we will create issues (I think replication options are lost).

I am still not completely sure I want this for good, as it would mean making those servers (or all the recentchanges ones) even more special. We can start thinking about it though, so far let's see if the test goes fine on Monday morning and we can discuss how to proceed further.
Ideally we would be able to do
if role == recentchanges then replication filter for __wmf_checksum
Ideal world! :-)

Change 339609 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1036

https://gerrit.wikimedia.org/r/339609

Change 339609 merged by jenkins-bot:
db-eqiad.php: Repool db1036

https://gerrit.wikimedia.org/r/339609

Mentioned in SAL (#wikimedia-operations) [2017-02-24T09:48:14Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1036 - T154485 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-27T07:59:41Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on revision table - T154485

I am running pt-table-checksum again on s2 (keeping the replication filters for the rc slaves - db1036 and db2035.) I have added --no-check-replication-filters in order to bypass the rc slaves.
ETA for revision table to be checked 1:20h

The table has been checksummed already and there was no lag generated, it took around 1:30h in the end. Now it is checking for the differences on the slaves, and that is going to take a huge amount of time which is probably just too much for a 23G table:

Waiting to check replicas for differences:   0% 2+02:53:47 remain

It was increasing all the time.
I have ctrl+C and will probably run with PTDEBUG=1 to see what is actually doing and see if it is getting stuck somewhere, because it seems too much, and if it is true, there is no way we can run this on enwiki.

@Marostegui Are you checking the dbs with filters? It may be stuck trying to read the rows from those hosts, which will never arrive: http://kedar.nitty-witty.com/blog/pt-table-checksum-waiting-to-check-replicas-for-differences-0-0000-remain

@Marostegui Are you checking the dbs with filters? It may be stuck trying to read the rows from those hosts, which will never arrive: http://kedar.nitty-witty.com/blog/pt-table-checksum-waiting-to-check-replicas-for-differences-0-0000-remain

Yes, I was just checking the dsns table on tendril because I thought I had remove those two slaves when we added the filters on Thursday, but apparently I didn't so it was definitely checking them.

You can keep them on the replication list, so they cannot lag, but run it with --no-replicate-check, we can do the checks later, independently.

Once removed those two slaves from the dsns_s2 table, the check was completed successfully:

02-27T11:32:17      0      0 45398597  453988       0 5878.306 nlwiki.revision

Checking for differences reveals that all slaves have the same data in the revision table (obviously the rc slaves are excluded for this check, but given that they didn't have any differences for the rest of the tables, it is likely that for this table they are probably consistent as well).

I've got a list of tables per wiki from s2 that do not have a PK (unfortunately they are not the same across all the wikis, although they are pretty similar). And I have built a pt-table-checksum command per database and with the tables to ignore per database as well. So I would like to run it (one by one) for all s2 shard and see what we get.

Will start everyday in the morning to avoid going in the evening. This run will no longer split between tables, and will include revision and logging tables in the same batch. We haven't seen any delays, so it should be okay to run in, again, one database at the time.

Mentioned in SAL (#wikimedia-operations) [2017-02-28T07:12:09Z] <marostegui> run pt-table-checksum on bgwiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T08:24:52Z] <marostegui> run pt-table-checksum on bgwiktionary (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T08:59:33Z] <marostegui> run pt-table-checksum on cswiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T10:23:29Z] <marostegui> run pt-table-checksum on enwikiquote (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T10:53:32Z] <marostegui> run pt-table-checksum on enwiktionary (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T14:32:29Z] <marostegui> run pt-table-checksum on eowiki (s2) - T154485

The following wikis were checksummed in s2 during the day with no issues:

bgwiki - no differences
bgwiktionary - no differences
cswiki - no differences
enwikiquote - no differences
enwiktionary - db1047 has different data in the archive table
eowiki - stopped because it was getting to late and I didn't want it to run if I was not present (will --resume tomorrow, it was already on the revision table, which ETA was 40 minutes)

I have resumed the run on eowiki:

Resuming from eowiki.revision chunk 932, timestamp 2017-02-28 17:13:33
Checksumming eowiki.revision:   3% 13:40 remain

Mentioned in SAL (#wikimedia-operations) [2017-03-01T13:16:33Z] <marostegui> run pt-table-checksum on idwiki - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-01T15:45:52Z] <marostegui> Resume pt-table-checksum on idwiki (s2) - T154485

Wikis tested today:
eowiki - no differences
fiwiki - no differences
idwiki - stopped on the page table as it is pretty much the end of the day and I didn't want to leave it running, although at this point I am fairly confident that it wouldn't cause any issue, it has been running in the background and unattended the whole day.

Mentioned in SAL (#wikimedia-operations) [2017-03-02T07:09:51Z] <marostegui> Resume pt-table-checksum on idwiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-02T08:27:11Z] <marostegui> Start pt-table-checksum on itwiki (s2)  - T154485

Today's checksums
idwiki - after resuming it today, db1047 has differences on one table:
idwiki.geo_tags 1 0 1 PRIMARY 988797 1916637

itwiki - dbstore1002 has differences on:

itwiki.wbc_entity_usage 26352 1 1 PRIMARY 4982283 4982418
itwiki.wbc_entity_usage 41520 -1 1 PRIMARY 6987621 6987766

Mentioned in SAL (#wikimedia-operations) [2017-03-03T08:22:25Z] <marostegui> Run pt-table-checksum on s2 (nowiki) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-03T10:53:38Z] <marostegui> Start pt-table-checksum on plwiki (s2) - T154485

Today checksums:
nowiki - dbstore1002 and db1047 have differences:

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nowiki.archive 5811 0 1 PRIMARY 631936 632068

Differences on dbstore1002
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nowiki.archive 5811 0 1 PRIMARY 631936 632068
nowiki.archive 6170 0 1 PRIMARY 681436 681558
nowiki.archive 6171 0 1 PRIMARY 681559 681716
nowiki.page_props 9346 -1 1 PRIMARY 1275171,1275171,wikibase_item 1275273,1275273,displaytitle
nowiki.page_props 9347 -1 1 PRIMARY 1275273,1275273,wikibase_item 1275466,1275466,defaultsort
nowiki.page_props 9507 -7 1 PRIMARY 1296037,1296037,wikibase_item 1296130,1296130,wikibase_item
nowiki.page_props 9509 -1 1 PRIMARY 1296213,1296213,defaultsort 1296312,1296312,wikibase_item
nowiki.page_props 9510 -5 1 PRIMARY 1296313,1296313,page_image_free 1296422,1296422,defaultsort
nowiki.page_props 9736 -1 1 PRIMARY 1319889,1319889,displaytitle 1320027,1320027,wikibase_item
nowiki.page_props 9737 -2 1 PRIMARY 1320028,1320028,page_image_free 1320129,1320129,wikibase_item
nowiki.page_props 9738 -14 1 PRIMARY 1320130,1320130,wikibase_item 1320231,1320231,wikibase_item
nowiki.page_props 9739 -2 1 PRIMARY 1320232,1320232,wikibase_item 1320303,1320303,defaultsort
nowiki.page_props 9740 -1 1 PRIMARY 1320303,1320303,page_image_free 1320400,1320400,wikibase_item
nowiki.page_props 9743 -3 1 PRIMARY 1320554,1320554,page_top_level_section_count 1320634,1320634,wikibase_item
nowiki.page_props 9746 -2 1 PRIMARY 1320728,1320728,defaultsort 1320820,1320820,wikibase_item
nowiki.page_props 9750 -1 1 PRIMARY 1321070,1321070,wikibase_item 1321125,1321125,page_image_free
nowiki.page_props 9802 -3 1 PRIMARY 1328742,1328742,wikibase_item 1328896,1328896,wikibase_item
nowiki.page_props 9803 -1 1 PRIMARY 1328897,1328897,defaultsort 1329046,1329046,wikibase_item
nowiki.page_props 9804 -8 1 PRIMARY 1329047,1329047,wikibase_item 1329190,1329190,wikibase_item
nowiki.page_props 9806 -6 1 PRIMARY 1329300,1329300,wikibase_item 1329429,1329429,displaytitle
nowiki.page_props 9808 -1 1 PRIMARY 1329599,1329599,wikibase_item 1329748,1329748,wikibase_item
nowiki.page_props 9811 -1 1 PRIMARY 1330082,1330082,wikibase_item 1330344,1330344,wikibase_item
nowiki.page_props 9812 -5 1 PRIMARY 1330345,1330345,wikibase_item 1330463,1330463,wikibase_item
nowiki.page_props 9818 -8 1 PRIMARY 1331053,1331053,wikibase_item 1331152,1331152,wikibase_item
nowiki.page_props 9819 -1 1 PRIMARY 1331153,1331153,wikibase_item 1331263,1331263,wikibase_item
nowiki.page_props 9820 -1 1 PRIMARY 1331265,1331265,page_top_level_section_count 1331392,1331392,wikibase_item
nowiki.page_props 9821 -3 1 PRIMARY 1331393,1331393,wikibase_item 1331510,1331510,wikibase_item
nowiki.page_props 9824 -1 1 PRIMARY 1331798,1331798,wikibase_item 1331907,1331907,wikibase_item
nowiki.page_props 9833 -1 1 PRIMARY 1332907,1332907,wikibase_item 1333008,1333008,page_image_free

plwiki has been stopped and will be resumed on Monday. Starting on Monday I will do all the wikis non-stop, as during this week there has been 0 problems while running them.

Mentioned in SAL (#wikimedia-operations) [2017-03-06T07:22:04Z] <marostegui> Resume pt-table-checksum on plwiki (s2) - T154485

I have started the checksum on plwiki again after all the crashes with db1060

Mentioned in SAL (#wikimedia-operations) [2017-03-08T07:27:12Z] <marostegui> Start pt-table-checksum on plwiki (s2) - T154485

It took a long run to finish plwiki, but here are the results.
differences in db1047 and quite lots of them in dbstore1002 too.

Marostegui moved this task from Backlog to In progress on the DBA board.Mon, Mar 13, 9:52 AM

ptwiki:
differences on db1047 and dbstore1002.

The checksum on svwiki stopped due to db1054 going down for maintenance. I will resume once it is back up again.

I am sorry :-(.

No worries whatsoever! This is now a background task that doesn't require much babysitting, so no problem at all!

svwiki has no differences
thwiki - differences on db1047 and dbstore1002
trwiki - in progress now

db1047:

db      tbl     total_rows      chunks
enwiktionary    archive 100     1
idwiki  geo_tags        100     1
nlwiki  archive 200     2
nowiki  archive 100     1
plwiki  archive 500     5
ptwiki  geo_tags        100     1
thwiki  geo_tags        100     1
zhwiki  archive 100     1

Good and bad news:
{P5061}
There are no differences in data, but the primary key values differ, probably because those were inserted with the unsafe INSERT...SELECT and different locking patterns happened. We should make sure that is no longer happening.

I intend to run:

(echo "SET SESSION sql_log_bin=0; " && mysqldump --single-transaction -h db1018.eqiad.wmnet enwiktionary archive --where "ar_id >= 1068351 and ar_id <= 1068357" --no-create-info --skip-add-locks --replace | grep REPLACE ) | mysql -h db1047.eqiad.wmnet enwiktionary

to fix db1047:enwiktionary

I intend to run:

(echo "SET SESSION sql_log_bin=0; " && mysqldump --single-transaction -h db1018.eqiad.wmnet enwiktionary archive --where "ar_id >= 1068351 and ar_id <= 1068357" --no-create-info --skip-add-locks --replace | grep REPLACE ) | mysql -h db1047.eqiad.wmnet enwiktionary

to fix db1047:enwiktionary

Looks good to me - thanks!! :)

I have done the above, there is no longer a diff for db1047 on enwiktionary.archive. I will repeat the steps for all detected changes. Some important tables are skipped (text)- I may have to check those manually stopping a slave and exporting its contents.

The last database for s2 was checksummed and these are the results:

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
zhwiki.archive 19830 0 1 PRIMARY 2339129 2339333

There are also differences on dbstore1002.

So we can say that s2 has been checksummed entirely (for all the tables with PKs)

I have actually fixed all db1047 found differences. I am working now on dbstore1002.

dbstore1002:

$ mysql -h $slave nlwiki -e "SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl"
+--------------+------------------+------------+--------+
| db           | tbl              | total_rows | chunks |
+--------------+------------------+------------+--------+
| bgwiki       | page_props       |       8741 |     91 |
| cswiki       | archive          |        100 |      1 |
| cswiki       | page_props       |       3791 |     39 |
| enwiktionary | page_props       |      26740 |    338 |
| eowiki       | flaggedimages    |        757 |     10 |
| eowiki       | flaggedrevs      |       2665 |     27 |
| eowiki       | flaggedtemplates |       1578 |     22 |
| eowiki       | page_props       |       6171 |     63 |
| fiwiki       | flaggedimages    |        573 |      7 |
| fiwiki       | flaggedrevs      |      21657 |    219 |
| fiwiki       | flaggedtemplates |       1975 |     23 |
| fiwiki       | page_props       |       3385 |     35 |
| idwiki       | flaggedimages    |        193 |      2 |
| idwiki       | flaggedrevs      |      11470 |    116 |
| idwiki       | flaggedtemplates |        230 |      3 |
| idwiki       | geo_tags         |        100 |      1 |
| itwiki       | page_props       |       1978 |     20 |
| itwiki       | wbc_entity_usage |        200 |      2 |
| nlwiki       | archive          |        100 |      1 |
| nlwiki       | page_props       |        297 |      3 |
| nowiki       | archive          |        300 |      3 |
| nowiki       | page_props       |       2519 |     26 |
| plwiki       | flaggedimages    |        390 |      4 |
| plwiki       | flaggedrevs      |      15332 |    155 |
| plwiki       | flaggedtemplates |        702 |      8 |
| plwiki       | logging          |         99 |      1 |
| plwiki       | page_props       |       5873 |     61 |
| plwiki       | spoofuser        |         99 |      1 |
| plwiki       | user             |         99 |      1 |
| ptwiki       | archive          |        100 |      1 |
| ptwiki       | geo_tags         |        100 |      1 |
| ptwiki       | page_props       |        688 |      7 |
| thwiki       | geo_tags         |        100 |      1 |
| thwiki       | page_props       |        692 |      7 |
| trwiki       | flaggedimages    |       1930 |     23 |
| trwiki       | flaggedrevs      |      11182 |    113 |
| trwiki       | flaggedtemplates |       2335 |     33 |
| trwiki       | page_props       |        983 |     10 |
| zhwiki       | page_props       |        875 |      9 |
+--------------+------------------+------------+--------+

dbstore1001 - it is 24 hours behind, so it has to be checked tomorrow again:

root@dbstore1001[nlwiki]> SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl;
+--------+----------+------------+--------+
| db     | tbl      | total_rows | chunks |
+--------+----------+------------+--------+
| cswiki | geo_tags |          2 |      1 |
| idwiki | geo_tags |        100 |      1 |
| nlwiki | page     |        100 |      1 |
| nlwiki | revision |         99 |      1 |
| nlwiki | user     |        100 |      1 |
| ptwiki | geo_tags |        100 |      1 |
| thwiki | geo_tags |        100 |      1 |
+--------+----------+------------+--------+
7 rows in set (6 min 9.99 sec)

db1021 and db1024- no changes.

db1036.eqiad.wmnet:

$ mysql -h $slave nlwiki -e "SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl"
+--------+---------+------------+--------+
| db     | tbl     | total_rows | chunks |
+--------+---------+------------+--------+
| nlwiki | archive |        800 |      8 |
+--------+---------+------------+--------+

I will focus on these so they can be decommissioned ASAP.

I finished dbstore1002 too (again, only for the tables with PKs) for s2, I will fix dbstore1001 now.

jcrespo claimed this task.Thu, Mar 16, 2:26 PM

dbstore1001 is fixed for s2, again with the exception of tables without PKs and zhwiki, which is finishing at this point. Going with db1036.

db1036 is now ok. dbstore2001.codfw.wmnet was all ok, but it is missing several rows from bv20**_edits tables (they are empty there).

I have checked all s2 slaves, including codfw and those not part of the specific lists. All either didn't have differences or have been corrected. The following things are to be taken into account: db1036 not fully checked because it is a "special slave". Labs ones (including db1069) are in a bad state, even if having into account filter differences, but they are going to be decommissioned soon. codfw slaves are all ok, but MIXED replication on the codfw master could kick in (master is good, however). New labs uses row, so they cannot be checked reliably.

I will now try to check some extra important tables on s2, at least:

oldimage
text
user_props

Most of the rest are cache/summary tables/derived data from parsing, etc.

Marostegui changed the title from "run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038" to "run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038)".Fri, Mar 24, 9:03 AM
Marostegui edited the task description. (Show Details)