Page MenuHomePhabricator

Reimage labsdb1011 to Buster and MariaDB 10.4
Open, HighPublic

Description

Recently, we have seen issues with Quarry and labsdb1011 (T247978 T246970)
Labsdb1011 is always very overloaded (https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=labsdb1011&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=now-24h&to=now&fullscreen&panelId=3) and we've seen in s8 that 10.4 and Buster seems to be performing better, CPU-wise.
Let's try to reimage this host with Buster and see what happens.

/srv won't be formatted.

The rollback plan would be to reimage to stretch and reclone from labsdb1012.

Once the host has no traffic, let's do a disk performance benchmarking to make sure its IOPS are similar to labsdb1012 ones.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+4 -4
operations/puppetproduction+4 -0
operations/puppetproduction+2 -1
operations/puppetproduction+4 -4
operations/puppetproduction+10 -7
operations/puppetproduction+7 -7
operations/puppetproduction+4 -4
operations/puppetproduction+4 -4
operations/puppetproduction+4 -4
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+6 -4
operations/puppetproduction+1 -1
operations/puppetproduction+4 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -4
operations/puppetproduction+0 -1
operations/puppetproduction+7 -5
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The copy to backup1002 is done. I have started mysql again on labsdb1011, once it has caught up, I will pool it back.
Fingers crossed.

Change 597098 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] wikireplica_analytics: Query killer decreased to 7200

https://gerrit.wikimedia.org/r/597098

Change 597098 merged by Marostegui:
[operations/puppet@production] wikireplica_analytics: Query killer decreased to 7200

https://gerrit.wikimedia.org/r/597098

labsdb1011 has caught up. I am waiting for @Bstorm to drop the wb_terms views (T251598) before repooling this host back finally!
If the host replicates well during the week, I will increase the innodb purge threads from 1 to 4 to see if we can keep that lagging under control - that is being tracked at T251719

bd808 added a comment.Mon, May 18, 7:27 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-18T16:45:55Z] <bstorm_> updated views on labsdb1011 for the wb_terms changes T251598

Mentioned in SAL (#wikimedia-operations) [2020-05-19T04:27:52Z] <marostegui> Repool labsdb1011 T249188

I have repooled labsdb1011 back \o/
In the end, I have gone for innodb_purge_threads = 4, given that we have a super recent backup of this host. Let's see if that helps the performance (or to keep the purge lag under control).

@Bstorm I have depooled labsdb1011 again as I was getting permission denied from quarry, however, the grants for my user are on labsdb1011, maybe some grants are missing or some of the maintain users script need to run?
I was also seeing that from tools bastion:

marostegui@tools-sgebastion-07:~$ sql --cluster analytics enwiki_p
ERROR 1044 (42000): Access denied for user 'u15343'@'%' to database 'enwiki_p'

However:

root@labsdb1011.eqiad.wmnet[(none)]> show grants for 'u15343'@'%';
+------------------------------------------------------------------------------------------------------------------------------------+
| Grants for u15343@%                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------+
| GRANT labsdbuser TO 'u15343'@'%'                                                                                                   |
| GRANT USAGE ON *.* TO 'u15343'@'%' IDENTIFIED BY PASSWORD '*xx' WITH MAX_USER_CONNECTIONS 10 |
+------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

root@labsdb1011.eqiad.wmnet[(none)]> show grants for labsdbuser;
+--------------------------------------------------------------------------+
| Grants for labsdbuser                                                    |
+--------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'labsdbuser'                                       |
| GRANT SELECT, SHOW VIEW ON `sahwiki\_p`.* TO 'labsdbuser'                |
...

root@labsdb1011.eqiad.wmnet[(none)]> show grants for labsdbuser;
| GRANT SELECT, SHOW VIEW ON `enwiki\_p`.* TO 'labsdbuser'                 |
| GRANT SELECT, SHOW VIEW ON `enwikiquote\_p`.* TO 'labsdbuser'            |
| GRANT SELECT, SHOW VIEW ON `tenwiki\_p`.* TO 'labsdbuser'                |
| GRANT SELECT, SHOW VIEW ON `enwikivoyage\_p`.* TO 'labsdbuser'           |
| GRANT SELECT, SHOW VIEW ON `enwikiversity\_p`.* TO 'labsdbuser'          |
| GRANT SELECT, SHOW VIEW ON `enwikisource\_p`.* TO 'labsdbuser'           |
| GRANT SELECT, SHOW VIEW ON `enwikinews\_p`.* TO 'labsdbuser'             |
| GRANT SELECT, SHOW VIEW ON `enwikibooks\_p`.* TO 'labsdbuser'            |

From quarry:

Error
Access denied for user 's52788'@'%' to database 'enwiki_p'
root@labsdb1011.eqiad.wmnet[mysql]> show grants for 's52788'@'%';
+------------------------------------------------------------------------------------------------------------------------------------+
| Grants for s52788@%                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------+
| GRANT labsdbuser TO 's52788'@'%'                                                                                                   |
| GRANT USAGE ON *.* TO 's52788'@'%' IDENTIFIED BY PASSWORD '*xx' WITH MAX_USER_CONNECTIONS 48 |
+------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
root@cumin1001:/home/marostegui# mysql.py -hlabsdb1010 mysql -e "select count(*) from user"
+----------+
| count(*) |
+----------+
|     4048 |
+----------+
root@cumin1001:/home/marostegui# mysql.py -hlabsdb1011 mysql -e "select count(*) from user"
+----------+
| count(*) |
+----------+
|     4049 |
+----------+

@Bstorm I guess we need maintain-meta_p or similar scripts to be run?

jcrespo added a comment.EditedTue, May 19, 6:25 AM

One thing I can see is that labsdb1011 uses the new mysql authentication format, meaning:

10.1 host:

                  Host: %
                  User: sXXXXX
              Password: <hashed password>
<list of global grants>
               plugin: 
 authentication_string:

10.4 host:

                  Host: %
                  User: sXXXXX
              Password: <hashed password>
<list of global grants>
                plugin: mysql_native_password
 authentication_string: <hashed password>

So the upgrade looks correct. I wonder, though, if some of cloud tools are obsolete due to the new user authentication format (relying on a specific mysql.user structure, rather than the generic grant system). I don't think that should break authentication, but it is worth checking.

If I create a new user on my test 10.4 installation, it has the same "duplicate hash" format, so that should be good.

Grants for labsdbuser, which is the default role on both servers for cloud users are also (almost) the same:

$ diff <(mysql.py -h labsdb1010 -e "show grants for labsdbuser" | sort) <(mysql.py -h labsdb1011 -e "show grants for labsdbuser" | sort)
290a291
> GRANT SELECT, SHOW VIEW ON `grwikiimedia\\_p`.* TO 'labsdbuser'

More food for thought, this also happens on the CLI:

root@labsdb1011:~# mysql --skip-ssl -uu15343 -p
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 8146
Server version: 10.4.12-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:u15343@localhost [(none)]> Ctrl-C -- exit!
Aborted
root@labsdb1011:~# mysql --skip-ssl -uu15343 -p enwiki_p
Enter password:
ERROR 1044 (42000): Access denied for user 'u15343'@'%' to database 'enwiki_p'

I have dropped and recreated my user but still having that issue.

More tests:

I have dropped the labsdbuser role, and recreated it. Assigned it to my user but still no luck.

mysql:root@localhost [(none)]> show grants for labsdbuser;
+----------------------------------------------------------+
| Grants for labsdbuser                                    |
+----------------------------------------------------------+
| GRANT USAGE ON *.* TO 'labsdbuser'                       |
| GRANT SELECT, SHOW VIEW ON `enwiki\_p`.* TO 'labsdbuser' |
+----------------------------------------------------------+
2 rows in set (0.000 sec)

mysql:root@localhost [(none)]> show grants for 'u15343'@'%';
+------------------------------------------------------------------------------------------------------------------------------------+
| Grants for u15343@%                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------+
| GRANT labsdbuser TO 'u15343'@'%'                                                                                                   |
| GRANT USAGE ON *.* TO 'u15343'@'%' IDENTIFIED BY PASSWORD '*xx' WITH MAX_USER_CONNECTIONS 10 |
+------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.000 sec)

mysql:root@localhost [(none)]> Ctrl-C -- exit!
Aborted
root@labsdb1011:~# mysql --skip-ssl -uu15343 -p  enwiki
Enter password:
ERROR 1044 (42000): Access denied for user 'u15343'@'%' to database 'enwiki'
root@labsdb1011:~# mysql --skip-ssl -uu15343 -p  enwiki_p
Enter password:
ERROR 1044 (42000): Access denied for user 'u15343'@'%' to database 'enwiki_p'

I tried a few more things:

mysql:root@localhost [(none)]> GRANT SELECT, SHOW VIEW ON `enwiki_p`.* TO 'labsdbuser'
    -> ;
Query OK, 0 rows affected (0.001 sec)

mysql:root@localhost [(none)]> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.082 sec)

mysql:root@localhost [(none)]> show grants for labsdbuser;
+----------------------------------------------------------+
| Grants for labsdbuser                                    |
+----------------------------------------------------------+
| GRANT USAGE ON *.* TO 'labsdbuser'                       |
| GRANT SELECT, SHOW VIEW ON `enwiki\_p`.* TO 'labsdbuser' |
| GRANT SELECT, SHOW VIEW ON `enwiki_p`.* TO 'labsdbuser'  |
+----------------------------------------------------------+
3 rows in set (0.000 sec)

mysql:root@localhost [(none)]> Ctrl-C -- exit!
Aborted
root@labsdb1011:~# mysql --skip-ssl -uu15343 -p  enwiki_p
Enter password:
ERROR 1044 (42000): Access denied for user 'u15343'@'%' to database 'enwiki_p'
root@labsdb1011:~# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 15932
Server version: 10.4.12-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:root@localhost [(none)]> GRANT SELECT, SHOW VIEW ON `enwiki`.* TO 'labsdbuser';
Query OK, 0 rows affected (0.001 sec)

mysql:root@localhost [(none)]> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.112 sec)

mysql:root@localhost [(none)]> Ctrl-C -- exit!
Aborted
root@labsdb1011:~# mysql --skip-ssl -uu15343 -p  enwiki
Enter password:
ERROR 1044 (42000): Access denied for user 'u15343'@'%' to database 'enwiki'
root@labsdb1011:~#

Could this be a mariadb 10.4 bug?

The issue with the roles is found:

mysql:u15343@localhost [(none)]> SELECT CURRENT_ROLE;
+--------------+
| CURRENT_ROLE |
+--------------+
| NULL         |
+--------------+
1 row in set (0.000 sec)

mysql:u15343@localhost [(none)]> SET ROLE labsdbuser;
Query OK, 0 rows affected (0.000 sec)

mysql:u15343@localhost [(none)]> use enwiki_p
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql:u15343@localhost [enwiki_p]>

mysql:root@localhost [(none)]> grant labsdbuser to u15343;
Query OK, 0 rows affected (0.000 sec)

mysql:root@localhost [(none)]> set default role labsdbuser for u15343;
Query OK, 0 rows affected (0.001 sec)

mysql:root@localhost [(none)]> Ctrl-C -- exit!
Aborted
root@labsdb1011:~# mysql --skip-ssl -uu15343 -p
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 17259
Server version: 10.4.12-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:u15343@localhost [(none)]> SELECT CURRENT_ROLE;
+--------------+
| CURRENT_ROLE |
+--------------+
| labsdbuser   |
+--------------+
1 row in set (0.000 sec)

mysql:u15343@localhost [(none)]> use enwiki_p
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql:u15343@localhost [enwiki_p]>

The fix is to grant the role again to all the users and make it default.

On the bad news side, the server has crashed, and it did a bit after I set innodb_purge_lag = 4 on T249188#6147299.
I am going to copy the backup back to the host and start it again. W

The action plan for now is:

  • Rebuild labsdb1011 from the backup we made yesterday.
    • Let it start with mariadb 10.4 default options (replication stopped)
    • Drop all the cl_sortkey indexes from the categorylinks tables (that index appears all the time on the crashes traces)
 [ERROR] Master 's2': InnoDB: Record in index `cl_sortkey` of table `zhwiki`.`categorylinks` was not found on update: TUPLE (info_bits=0, 4 fields): {[27]                           (0xE6A183E59C92E5B882E9AB98E7B49AE4B8ADE7AD89E5ADB8E6A0A1),[1] (0x01),[39]                                       (0xE6A183E59C92E5B882E68890E58A9FE9A
	InnoDB: Record in index `cl_sortkey` of table `enwiktionary`.`categorylinks` was not found on
  • Create that index back
  • Start replication
  • Pray
  • If none of this happens and we run out of other ideas: reimage back to stretch and mariadb 10.1 and copy the content from labsdb1012 again.

At the same time, in order to discard HW issues (specially memory related) we are going to copy the same backup to a different host and let it replicate and see if it crashes.

Change 597241 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Use db1141 as test for labsdb1011 issues.

https://gerrit.wikimedia.org/r/597241

Change 597241 merged by Kormat:
[operations/puppet@production] mariadb: Use db1141 as test for labsdb1011 issues.

https://gerrit.wikimedia.org/r/597241

These are the positions where labsdb1011 stopped before sending the "up-to-date" binary log to backups1002:

May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 406 [Note] Master 's1': Slave SQL thread exiting, replication stopped in log 'db1124-bin.002644' at position 756734814
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 405 [Note] Master 's1': Slave I/O thread exiting, read up to log 'db1124-bin.002644', position 756743357
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 408 [Note] Master 's5': Slave SQL thread exiting, replication stopped in log 'db1124-bin.001209' at position 661970737
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 407 [Note] Master 's5': Slave I/O thread exiting, read up to log 'db1124-bin.001209', position 661970737
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 410 [Note] Master 's2': Slave SQL thread exiting, replication stopped in log 'db1125-bin.002350' at position 916087667
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 409 [Note] Master 's2': Slave I/O thread exiting, read up to log 'db1125-bin.002350', position 916156546
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 412 [Note] Master 's3': Slave SQL thread exiting, replication stopped in log 'db1124-bin.002036' at position 394796134
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 411 [Note] Master 's3': Slave I/O thread exiting, read up to log 'db1124-bin.002036', position 394796134
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 414 [Note] Master 's4': Slave SQL thread exiting, replication stopped in log 'db1125-bin.003546' at position 936049523
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 413 [Note] Master 's4': Slave I/O thread exiting, read up to log 'db1125-bin.003546', position 936049523
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 416 [Note] Master 's8': Slave SQL thread exiting, replication stopped in log 'db1124-bin.003661' at position 860109621
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 415 [Note] Master 's8': Slave I/O thread exiting, read up to log 'db1124-bin.003661', position 860109621
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 418 [Note] Master 's6': Slave SQL thread exiting, replication stopped in log 'db1125-bin.001487' at position 587410550
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 417 [Note] Master 's6': Slave I/O thread exiting, read up to log 'db1125-bin.001487', position 587410550
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 420 [Note] Master 's7': Slave SQL thread exiting, replication stopped in log 'db1125-bin.002147' at position 10141175
May 18 05:01:04 labsdb1011 mysqld[13194]: 2020-05-18  5:01:04 419 [Note] Master 's7': Slave I/O thread exiting, read up to log 'db1125-bin.002147', position 10141667

The alter table to regenerate categorylinks indexes:

for i in `mysql information_schema -e "select table_schema from tables where table_name='categorylinks' and table_type='BASE TABLE'"  -BN`; do echo $i; mysql $i -e "set session sql_log_bin=0; alter table categorylinks drop key if exists cl_sortkey, add key cl_sortkey (cl_to,cl_type,cl_sortkey,cl_from)" ; done

Mentioned in SAL (#wikimedia-operations) [2020-05-20T07:41:03Z] <marostegui> alter table categorylinks engine=Innodb ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8,force on all labsdb1011 wikis - T249188

@Kormat and @jcrespo have finished the cloning from the backup1001 (the backup that contains the data right after the original import from the mydumper files) to db1141 (as a temporary test host).
db1141 is now catching up:

root@db1141:~# mysql -e "show all slaves status\G" | grep Seconds
         Seconds_Behind_Master: 1730158
         Seconds_Behind_Master: 1730752
         Seconds_Behind_Master: 1729120
         Seconds_Behind_Master: 1731311
         Seconds_Behind_Master: 1722920
         Seconds_Behind_Master: 1728384
         Seconds_Behind_Master: 1728070
         Seconds_Behind_Master: 1728692

labsdb1011 has finished dropping and creating all cl_sortkey index on categorylinks table across all the wikis.
Right now I am rebuilding categorylinks everywhere on labsdb1011 too before starting replication.

Change 597476 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Temporarilly add db1141 to the list of special hosts

https://gerrit.wikimedia.org/r/597476

Change 597476 merged by Jcrespo:
[operations/puppet@production] mariadb: Temporarilly add db1141 to the list of special hosts

https://gerrit.wikimedia.org/r/597476

We have seen more weird things with the grants, for that I have created: https://jira.mariadb.org/browse/MDEV-22645

The alter table on all the categorylinks tables finished. I have now started replication.

root@labsdb1011:~# mysql -e "show all slaves status\G" | grep Seconds
         Seconds_Behind_Master: 216483
         Seconds_Behind_Master: 215119
         Seconds_Behind_Master: 214472
         Seconds_Behind_Master: 216846
         Seconds_Behind_Master: 218187
         Seconds_Behind_Master: 213065
         Seconds_Behind_Master: 213982
         Seconds_Behind_Master: 215624
root@labsdb1011:~#

labsdb1011 keeps catching up nicely:

root@labsdb1011:~# mysql -e "show all slaves status\G" | grep Seconds
         Seconds_Behind_Master: 115303
         Seconds_Behind_Master: 9768
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 139544
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 48016

labsdb1011 is up-to-date:

#  mysql.py -hlabsdb1011 -e "show all slaves status\G" | grep Seconds
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0

Change 597762 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/597762

Change 597762 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/597762

Mentioned in SAL (#wikimedia-operations) [2020-05-21T12:12:51Z] <marostegui> Repool labsdb1011 into the analytics role 🤞- T249188

labsdb1011 is now serving queries.
Quarry seems to be working fine too: https://quarry.wmflabs.org/query/45075

labsdb1011 seems to be working fine.
Good news is that 10.4+Buster seems to confirm that the CPU usage is a lot better and the host isn't having almost 100% usage as it used to.

So labsdb1011 looks stable. CPU seems to be stable at around 30% usage (which is a big improvement compared to the previous values).

Lag grows, but that is something not uncommon for the analytics role.
This is how labsdb1010 behaved for the weeks it's been serving the analytics role:

I am going to leave labsdb1011 working for the weekend, but on Monday we should do the last test to see if the restart is what causes the corruption.
My idea is to:

  • Depool labsdb1011
  • Let all the running queries finish
  • Let the slave lag catch up on all the threads
  • Stop mysql
  • Start mysql
  • Monitoring log
  • Repool the host back.
  • Check again for a few days if crashes re-appear.

Labsdb1011 has been working fine since Thursday, the lag though keeps growing now less fast, but still there.
I am going to go ahead and do the above test:

Depool labsdb1011
Let all the running queries finish
Let the slave lag catch up on all the threads
Stop mysql
Start mysql
Monitoring log
Repool the host back.
Check again for a few days if crashes re-appear.

If all this goes fine and labsdb1011 keeps working fine for a few more days, maybe we should pool labsdb1010 to balance queries for the analytics role rather than the web role, which seems to have no load/lag issues.

Change 598294 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598294

Change 598294 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598294

Mentioned in SAL (#wikimedia-operations) [2020-05-25T04:54:54Z] <marostegui> Depool labsdb1011 - T249188

Mentioned in SAL (#wikimedia-operations) [2020-05-25T11:21:47Z] <marostegui> Extend db1141's (temporary labsdb test host) /srv 1TB extra - T249188

Mentioned in SAL (#wikimedia-operations) [2020-05-26T04:13:58Z] <marostegui> Stop slaves and stop mysql on labsdb1011 T249188

So looks like the restart is the one causing issues.
The stop slaves reported no issues, neither did the stop mysql.
Starting mysql brought no issues, but as soon as I started the replication threads....bad news:

May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 31 [Note] Master 's6': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3316' in log 'db1125-bin.001516' at position 150259934
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 33 [Note] Master 's2': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3312' in log 'db1125-bin.002403' at position 440364672
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 35 [Note] Master 's3': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3313' in log 'db1124-bin.002091' at position 844723555
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 37 [Note] Master 's4': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3314' in log 'db1125-bin.003581' at position 680368537
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 39 [Note] Master 's8': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3318' in log 'db1124-bin.003703' at position 26857432
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 41 [Note] Master 's5': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3315' in log 'db1124-bin.001226' at position 889729958
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 43 [Note] Master 's7': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3317' in log 'db1125-bin.002189' at position 486105063
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 45 [Note] Master 's1': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3311' in log 'db1124-bin.002696' at position 813770887
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 37 [Note] Master 's4': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3314',replication started in log 'db1125-bin.003581' at position 680368537
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 35 [Note] Master 's3': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3313',replication started in log 'db1124-bin.002091' at position 844723555
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 43 [Note] Master 's7': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3317',replication started in log 'db1125-bin.002189' at position 486105063
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 45 [Note] Master 's1': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3311',replication started in log 'db1124-bin.002696' at position 813770887
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 33 [Note] Master 's2': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3312',replication started in log 'db1125-bin.002403' at position 440364672
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 31 [Note] Master 's6': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3316',replication started in log 'db1125-bin.001516' at position 150259934
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 41 [Note] Master 's5': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3315',replication started in log 'db1124-bin.001226' at position 889729958
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 39 [Note] Master 's8': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3318',replication started in log 'db1124-bin.003703' at position 26857432
May 26 04:20:44 labsdb1011 mysqld[19806]: 2020-05-26  4:20:44 4 [ERROR] InnoDB: Unable to find a record to delete-mark
May 26 04:20:44 labsdb1011 mysqld[19806]: InnoDB: tuple DATA TUPLE: 4 fields;
May 26 04:20:44 labsdb1011 mysqld[19806]:  0: len 4; hex 80000004; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  1: len 4; hex 80000000; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  2: len 22; hex 5468655f47797073795f616e645f7468655f4b696e67; asc The_Gypsy_and_the_King;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  3: len 4; hex 00109a35; asc    5;;
May 26 04:20:44 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 4; compact format; info bits 0
May 26 04:20:44 labsdb1011 mysqld[19806]:  0: len 4; hex 80000004; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  1: len 4; hex 80000000; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  2: len 15; hex 5468655f47797073795f547261696c; asc The_Gypsy_Trail;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  3: len 4; hex 02e4fc4d; asc    M;;
May 26 04:20:44 labsdb1011 mysqld[19806]: 2020-05-26  4:20:44 4 [ERROR] InnoDB: page [page id: space=94970, page number=4807528] (305 records, index id 334204).
May 26 04:20:44 labsdb1011 mysqld[19806]: 2020-05-26  4:20:44 4 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 26 04:20:45 labsdb1011 mysqld[19806]: 2020-05-26  4:20:45 4 [ERROR] InnoDB: Unable to find a record to delete-mark
May 26 04:20:45 labsdb1011 mysqld[19806]: InnoDB: tuple DATA TUPLE: 3 fields;
May 26 04:20:45 labsdb1011 mysqld[19806]:  0: len 4; hex 80000000; asc     ;;
May 26 04:20:45 labsdb1011 mysqld[19806]:  1: len 19; hex 484d535f5265736f6c7574655f283138353029; asc HMS_Resolute_(1850);;
May 26 04:20:45 labsdb1011 mysqld[19806]:  2: len 4; hex 02ea5d4b; asc   ]K;;
May 26 04:20:45 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
May 26 04:20:45 labsdb1011 mysqld[19806]:  0: len 4; hex 80000000; asc     ;;
May 26 04:20:45 labsdb1011 mysqld[19806]:  1: len 19; hex 484d535f5265736f6c7574655f283138353029; asc HMS_Resolute_(1850);;
May 26 04:20:45 labsdb1011 mysqld[19806]:  2: len 4; hex 02e860c8; asc   ` ;;
May 26 04:20:45 labsdb1011 mysqld[19806]: 2020-05-26  4:20:45 4 [ERROR] InnoDB: page [page id: space=94970, page number=1579497] (421 records, index id 334203).
May 26 04:20:45 labsdb1011 mysqld[19806]: 2020-05-26  4:20:45 4 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 26 04:20:47 labsdb1011 mysqld[19806]: 2020-05-26  4:20:47 46 [ERROR] Master 's1': InnoDB: Record in index `pl_backlinks_namespace` of table `enwiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[22]The_Gypsy_and_the_King(0x5468655F47797073795F616E645F7468655F4B696E67),[4]  e (0x000665A4)} at: COMPACT RECORD(info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[15]The_Gypsy_Trail(0x5468655F47797073795F547261696C),[4]   M(0x02E4FC4D)}
May 26 04:20:53 labsdb1011 mysqld[19806]: 2020-05-26  4:20:53 3 [ERROR] InnoDB: Unable to find a record to delete-mark
May 26 04:20:53 labsdb1011 mysqld[19806]: InnoDB: tuple DATA TUPLE: 3 fields;

May 26 04:21:51 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 4; compact format; info bits 0
May 26 04:21:51 labsdb1011 mysqld[19806]:  0: len 4; hex 80000002; asc     ;;
May 26 04:21:51 labsdb1011 mysqld[19806]:  1: len 4; hex 80000000; asc     ;;
May 26 04:21:51 labsdb1011 mysqld[19806]:  2: len 10; hex 43616c696d6572697573; asc Calimerius;;
May 26 04:21:51 labsdb1011 mysqld[19806]:  3: len 4; hex 00dac0fe; asc     ;;
May 26 04:21:51 labsdb1011 mysqld[19806]: 2020-05-26  4:21:51 4 [ERROR] InnoDB: page [page id: space=94970, page number=10530772] (382 records, index id 334204).
May 26 04:21:51 labsdb1011 mysqld[19806]: 2020-05-26  4:21:51 4 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 26 04:22:01 labsdb1011 mysqld[19806]: 2020-05-26  4:22:01 46 [ERROR] Master 's1': InnoDB: Record in index `pl_backlinks_namespace` of table `enwiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[22]The_Gypsy_and_the_King(0x5468655F47797073795F616E645F7468655F4B696E67),[4]    (0x00100CFD)} at: COMPACT RECORD(info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[15]The_Gypsy_Trail(0x5468655F47797073795F547261696C),[4]   M(0x02E4FC4D)}

I am going to leave this running for a few minutes, but I want to pool this back to see if it really crashes, as this hasn't crashed yet.

Mentioned in SAL (#wikimedia-operations) [2020-05-26T04:35:38Z] <marostegui> Repool labsdb1011 - T249188

I have repooled the host and the queries are arriving.

The errors stopped and they are definitely not happening as fast as they used to.
This was the last one:

May 26 04:34:18 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
May 26 04:34:18 labsdb1011 mysqld[19806]:  0: len 4; hex 006ff6ff; asc  o  ;;
May 26 04:34:18 labsdb1011 mysqld[19806]:  1: len 30; hex 687474703a2f2f676f762e6e6173612e6a706c2e7373642e2f736264622e; asc http://gov.nasa.jpl.ssd./sbdb.; (total 54 bytes);
May 26 04:34:18 labsdb1011 mysqld[19806]:  2: len 4; hex 0106a2c5; asc     ;;
May 26 04:34:18 labsdb1011 mysqld[19806]: space 98868 offset 965562 (183 records, index id 347862)

Still no more errors, the only new message was, interestingly:

May 26 04:41:29 labsdb1011 mysqld[19806]: 2020-05-26  4:41:29 0 [Note] InnoDB: Buffer pool(s) load completed at 200526  4:41:28

Maybe all those above are related to the load process of the buffer pool?

Interesting, I just saw this:
https://jira.mariadb.org/browse/MDEV-22497 which was closed a few days ago

This is exactly the error we are seeing and seems to be fixed on 10.4.13 (we are running 10.4.12 on labsdb1011, even though we do have .13 on the repo).
I am going to upgrade this version and "reclone" this host again from the backup1002 files.

Interesting, I just saw this:
https://jira.mariadb.org/browse/MDEV-22497 which was closed a few days ago

This is exactly the error we are seeing and seems to be fixed on 10.4.13 (we are running 10.4.12 on labsdb1011, even though we do have .13 on the repo).
I am going to upgrade this version and "reclone" this host again from the backup1002 files.

I just realised that this host does run 10.4.13 already :(

Change 598571 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/598571

Change 598571 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/598571

Situation as of now: I have repooled labsdb1011, it keeps having some of those errors, but it is not crashing. I want to see what happens once it gets load and normal usage.

The host keeps performing fine, serving queries and having no lag -it has had no crashes despite the fact that it keeps logging those errors from time to time.

Change 598691 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Add labsdb1010 with reduced weight

https://gerrit.wikimedia.org/r/598691

And the host finally crashed:

May 27 02:11:02 labsdb1011 mysqld[23527]: 2020-05-27 02:11:02 0x7fc65c207700  InnoDB: Assertion failure in file /root/mariadb-10.4.13/storage/innobase/row/row0ins.cc line 231
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: Failing assertion: !cursor->index->is_committed()
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: We intentionally generate a memory trap.
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: If you get repeated assertion failures or crashes, even
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: immediately after the mysqld startup, there may be
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: corruption in the InnoDB tablespace. Please refer to
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: about forcing recovery.
May 27 02:11:02 labsdb1011 mysqld[23527]: 200527  2:11:02 [ERROR] mysqld got signal 6 ;
May 27 02:11:02 labsdb1011 mysqld[23527]: This could be because you hit a bug. It is also possible that this binary
May 27 02:11:02 labsdb1011 mysqld[23527]: or one of the libraries it was linked against is corrupt, improperly built,
May 27 02:11:02 labsdb1011 mysqld[23527]: or misconfigured. This error can also be caused by malfunctioning hardware.
May 27 02:11:02 labsdb1011 mysqld[23527]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
May 27 02:11:02 labsdb1011 mysqld[23527]: We will try our best to scrape up some info that will hopefully help
May 27 02:11:02 labsdb1011 mysqld[23527]: diagnose the problem, but since we have already crashed,
May 27 02:11:02 labsdb1011 mysqld[23527]: something is definitely wrong and this may fail.
May 27 02:11:02 labsdb1011 mysqld[23527]: Server version: 10.4.13-MariaDB
May 27 02:11:02 labsdb1011 mysqld[23527]: key_buffer_size=134217728
May 27 02:11:02 labsdb1011 mysqld[23527]: read_buffer_size=131072
May 27 02:11:02 labsdb1011 mysqld[23527]: max_used_connections=125
May 27 02:11:02 labsdb1011 mysqld[23527]: max_threads=1026
May 27 02:11:02 labsdb1011 mysqld[23527]: thread_count=148
May 27 02:11:02 labsdb1011 mysqld[23527]: It is possible that mysqld could use up to
May 27 02:11:02 labsdb1011 mysqld[23527]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2388976 K  bytes of memory
May 27 02:11:02 labsdb1011 mysqld[23527]: Hope that's ok; if not, decrease some variables in the equation.
May 27 02:11:02 labsdb1011 mysqld[23527]: Thread pointer: 0x7f5f1c0014f8
May 27 02:11:02 labsdb1011 mysqld[23527]: Attempting backtrace. You can use the following information to find out
May 27 02:11:02 labsdb1011 mysqld[23527]: where mysqld died. If you see no messages after this, something went
May 27 02:11:02 labsdb1011 mysqld[23527]: terribly wrong...
May 27 02:11:02 labsdb1011 mysqld[23527]: stack_bottom = 0x7fc65c206698 thread_stack 0x30000
May 27 02:11:04 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(my_print_stacktrace+0x2e)[0x556d7c07f7de]
May 27 02:11:06 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(handle_fatal_signal+0x54d)[0x556d7bb77b4d]
May 27 02:11:08 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7fc67eb8e730]
May 27 02:11:10 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7fc67e1f67bb]
May 27 02:11:11 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7fc67e1e1535]
May 27 02:11:13 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0x5a19d5)[0x556d7b8739d5]
May 27 02:11:15 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0x5907c1)[0x556d7b8627c1]
May 27 02:11:17 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xaeee0e)[0x556d7bdc0e0e]
May 27 02:11:19 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xaef4bd)[0x556d7bdc14bd]
May 27 02:11:21 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xaff23a)[0x556d7bdd123a]
May 27 02:11:22 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xa4dbd5)[0x556d7bd1fbd5]
May 27 02:11:24 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN7handler12ha_write_rowEPKh+0x14d)[0x556d7bb837ed]
May 27 02:11:26 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN14Rows_log_event9write_rowEP14rpl_group_infob+0x174)[0x556d7bc755d4]
May 27 02:11:28 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN20Write_rows_log_event11do_exec_rowEP14rpl_group_info+0x7d)[0x556d7bc75b6d]
May 27 02:11:30 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x23c)[0x556d7bc69f8c]
May 27 02:11:31 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0x5faf02)[0x556d7b8ccf02]
May 27 02:11:33 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(handle_slave_sql+0x12e2)[0x556d7b8d5ef2]
May 27 02:11:35 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xd5e28b)[0x556d7c03028b]
May 27 02:11:37 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3)[0x7fc67eb83fa3]
May 27 02:11:39 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fc67e2b84cf]
May 27 02:11:39 labsdb1011 mysqld[23527]: Trying to get some variables.
May 27 02:11:39 labsdb1011 mysqld[23527]: Some pointers may be invalid and cause the dump to abort.
May 27 02:11:39 labsdb1011 mysqld[23527]: Query (0x0):
May 27 02:11:39 labsdb1011 mysqld[23527]: Connection ID (thread ID): 88
May 27 02:11:39 labsdb1011 mysqld[23527]: Status: NOT_KILLED
May 27 02:11:39 labsdb1011 mysqld[23527]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
May 27 02:11:39 labsdb1011 mysqld[23527]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
May 27 02:11:39 labsdb1011 mysqld[23527]: information that should help you find out what is causing the crash.
May 27 02:11:39 labsdb1011 mysqld[23527]: Writing a core file...
May 27 02:11:39 labsdb1011 mysqld[23527]: Working directory at /srv/sqldata
May 27 02:11:39 labsdb1011 mysqld[23527]: Resource Limits:
May 27 02:11:39 labsdb1011 mysqld[23527]: Limit                     Soft Limit           Hard Limit           Units
May 27 02:11:39 labsdb1011 mysqld[23527]: Max cpu time              unlimited            unlimited            seconds
May 27 02:11:39 labsdb1011 mysqld[23527]: Max file size             unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max data size             unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max stack size            8388608              unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max core file size        0                    0                    bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max resident set          unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max processes             2063395              2063395              processes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max open files            200001               200001               files
May 27 02:11:39 labsdb1011 mysqld[23527]: Max locked memory         65536                65536                bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max address space         unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max file locks            unlimited            unlimited            locks
May 27 02:11:39 labsdb1011 mysqld[23527]: Max pending signals       2063395              2063395              signals
May 27 02:11:39 labsdb1011 mysqld[23527]: Max msgqueue size         819200               819200               bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max nice priority         0                    0
May 27 02:11:39 labsdb1011 mysqld[23527]: Max realtime priority     0                    0
May 27 02:11:39 labsdb1011 mysqld[23527]: Max realtime timeout      unlimited            unlimited            us
May 27 02:11:39 labsdb1011 mysqld[23527]: Core pattern: /var/tmp/core/core.%h.%e.%p....
May 27 02:12:13 labsdb1011 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT

Change 598909 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598909

@Kormat @jcrespo db1141 has caught up with replication, let's get mysqldump from it and attempt a restart?

I am thinking to reimage labsdb1011 back to Stretch and reimport a mysqldump, so maybe we can use db1141 to extract it and then see if the reimage ends up with the host crashing too?

Change 598909 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598909

Mentioned in SAL (#wikimedia-operations) [2020-05-27T04:20:19Z] <marostegui> Depool labsdb1011 - T249188

Taking a mydumper from db1141 using:

root@db1141:/srv# /usr/bin/mydumper --compress --events --triggers --routines --logfile /srv/backups/dumps/ongoing/db1141.log --outputdir /srv/backups/dumps/ongoing/dump.db1141_27_05_2020 --rows 20000000 --threads 18 --host localhost --user root

host localhost

Will it fit?

Yep, it has 2.1T available and the dump on labsdb1009 is around 880GB

Mentioned in SAL (#wikimedia-operations) [2020-05-28T04:22:13Z] <marostegui> Stop MySQL on db1141 - T249188

Positions where db1141 was stopped for restart:

1May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4488 [Note] Master 's2': Slave SQL thread exiting, replication stopped in log 'db1125-bin.002418' at position 403740127
2May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4432 [Note] Master 's2': Slave I/O thread exiting, read up to log 'db1125-bin.002418', position 403740127
3May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4489 [Note] Master 's6': Slave SQL thread exiting, replication stopped in log 'db1125-bin.001525' at position 513418261
4May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4433 [Note] Master 's6': Slave I/O thread exiting, read up to log 'db1125-bin.001525', position 513418261
5May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4490 [Note] Master 's3': Slave SQL thread exiting, replication stopped in log 'db1124-bin.002105' at position 885891780
6May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4434 [Note] Master 's3': Slave I/O thread exiting, read up to log 'db1124-bin.002105', position 885891780
7May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4491 [Note] Master 's4': Slave SQL thread exiting, replication stopped in log 'db1125-bin.003591' at position 728051471
8May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4435 [Note] Master 's4': Slave I/O thread exiting, read up to log 'db1125-bin.003591', position 728051471
9May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4492 [Note] Master 's5': Slave SQL thread exiting, replication stopped in log 'db1124-bin.001231' at position 63949745
10May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4436 [Note] Master 's5': Slave I/O thread exiting, read up to log 'db1124-bin.001231', position 63949745
11May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4493 [Note] Master 's8': Slave SQL thread exiting, replication stopped in log 'db1124-bin.003713' at position 937829874
12May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4437 [Note] Master 's8': Slave I/O thread exiting, read up to log 'db1124-bin.003713', position 937829874
13May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4494 [Note] Master 's7': Slave SQL thread exiting, replication stopped in log 'db1125-bin.002201' at position 788792501
14May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4438 [Note] Master 's7': Slave I/O thread exiting, read up to log 'db1125-bin.002201', position 788792501
15May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4495 [Note] Master 's1': Slave SQL thread exiting, replication stopped in log 'db1124-bin.002711' at position 475182352
16May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4439 [Note] Master 's1': Slave I/O thread exiting, read up to log 'db1124-bin.002711', position 475182352
17
18
19
20
21
22root@db1141:~# mysql -e "show all slaves status\G"
23*************************** 1. row ***************************
24 Connection_name: s1
25 Slave_SQL_State:
26 Slave_IO_State:
27 Master_Host: db1124.eqiad.wmnet
28 Master_User: repl
29 Master_Port: 3311
30 Connect_Retry: 60
31 Master_Log_File: db1124-bin.002711
32 Read_Master_Log_Pos: 475182352
33 Relay_Log_File: db1141-relay-bin-s1.000321
34 Relay_Log_Pos: 475182648
35 Relay_Master_Log_File: db1124-bin.002711
36 Slave_IO_Running: No
37 Slave_SQL_Running: No
38 Replicate_Do_DB:
39 Replicate_Ignore_DB:
40 Replicate_Do_Table:
41 Replicate_Ignore_Table:
42 Replicate_Wild_Do_Table:
43 Replicate_Wild_Ignore_Table:
44 Last_Errno: 0
45 Last_Error:
46 Skip_Counter: 0
47 Exec_Master_Log_Pos: 475182352
48 Relay_Log_Space: 475183276
49 Until_Condition: None
50 Until_Log_File:
51 Until_Log_Pos: 0
52 Master_SSL_Allowed: Yes
53 Master_SSL_CA_File:
54 Master_SSL_CA_Path:
55 Master_SSL_Cert:
56 Master_SSL_Cipher:
57 Master_SSL_Key:
58 Seconds_Behind_Master: NULL
59 Master_SSL_Verify_Server_Cert: No
60 Last_IO_Errno: 0
61 Last_IO_Error:
62 Last_SQL_Errno: 0
63 Last_SQL_Error:
64 Replicate_Ignore_Server_Ids:
65 Master_Server_Id: 0
66 Master_SSL_Crl:
67 Master_SSL_Crlpath:
68 Using_Gtid: No
69 Gtid_IO_Pos:
70 Replicate_Do_Domain_Ids:
71 Replicate_Ignore_Domain_Ids:
72 Parallel_Mode: conservative
73 SQL_Delay: 0
74 SQL_Remaining_Delay: NULL
75 Slave_SQL_Running_State:
76 Slave_DDL_Groups: 0
77Slave_Non_Transactional_Groups: 0
78 Slave_Transactional_Groups: 0
79 Retried_transactions: 0
80 Max_relay_log_size: 1073741824
81 Executed_log_entries: 0
82 Slave_received_heartbeats: 0
83 Slave_heartbeat_period: 30.000
84 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
85*************************** 2. row ***************************
86 Connection_name: s2
87 Slave_SQL_State:
88 Slave_IO_State:
89 Master_Host: db1125.eqiad.wmnet
90 Master_User: repl
91 Master_Port: 3312
92 Connect_Retry: 60
93 Master_Log_File: db1125-bin.002418
94 Read_Master_Log_Pos: 403740127
95 Relay_Log_File: db1141-relay-bin-s2.000305
96 Relay_Log_Pos: 403740423
97 Relay_Master_Log_File: db1125-bin.002418
98 Slave_IO_Running: No
99 Slave_SQL_Running: No
100 Replicate_Do_DB:
101 Replicate_Ignore_DB:
102 Replicate_Do_Table:
103 Replicate_Ignore_Table:
104 Replicate_Wild_Do_Table:
105 Replicate_Wild_Ignore_Table:
106 Last_Errno: 0
107 Last_Error:
108 Skip_Counter: 0
109 Exec_Master_Log_Pos: 403740127
110 Relay_Log_Space: 403741051
111 Until_Condition: None
112 Until_Log_File:
113 Until_Log_Pos: 0
114 Master_SSL_Allowed: Yes
115 Master_SSL_CA_File:
116 Master_SSL_CA_Path:
117 Master_SSL_Cert:
118 Master_SSL_Cipher:
119 Master_SSL_Key:
120 Seconds_Behind_Master: NULL
121 Master_SSL_Verify_Server_Cert: No
122 Last_IO_Errno: 0
123 Last_IO_Error:
124 Last_SQL_Errno: 0
125 Last_SQL_Error:
126 Replicate_Ignore_Server_Ids:
127 Master_Server_Id: 0
128 Master_SSL_Crl:
129 Master_SSL_Crlpath:
130 Using_Gtid: No
131 Gtid_IO_Pos:
132 Replicate_Do_Domain_Ids:
133 Replicate_Ignore_Domain_Ids:
134 Parallel_Mode: conservative
135 SQL_Delay: 0
136 SQL_Remaining_Delay: NULL
137 Slave_SQL_Running_State:
138 Slave_DDL_Groups: 0
139Slave_Non_Transactional_Groups: 0
140 Slave_Transactional_Groups: 0
141 Retried_transactions: 0
142 Max_relay_log_size: 1073741824
143 Executed_log_entries: 0
144 Slave_received_heartbeats: 0
145 Slave_heartbeat_period: 30.000
146 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
147*************************** 3. row ***************************
148 Connection_name: s3
149 Slave_SQL_State:
150 Slave_IO_State:
151 Master_Host: db1124.eqiad.wmnet
152 Master_User: repl
153 Master_Port: 3313
154 Connect_Retry: 60
155 Master_Log_File: db1124-bin.002105
156 Read_Master_Log_Pos: 885891780
157 Relay_Log_File: db1141-relay-bin-s3.000341
158 Relay_Log_Pos: 885892076
159 Relay_Master_Log_File: db1124-bin.002105
160 Slave_IO_Running: No
161 Slave_SQL_Running: No
162 Replicate_Do_DB:
163 Replicate_Ignore_DB:
164 Replicate_Do_Table:
165 Replicate_Ignore_Table:
166 Replicate_Wild_Do_Table:
167 Replicate_Wild_Ignore_Table:
168 Last_Errno: 0
169 Last_Error:
170 Skip_Counter: 0
171 Exec_Master_Log_Pos: 885891780
172 Relay_Log_Space: 885892704
173 Until_Condition: None
174 Until_Log_File:
175 Until_Log_Pos: 0
176 Master_SSL_Allowed: Yes
177 Master_SSL_CA_File:
178 Master_SSL_CA_Path:
179 Master_SSL_Cert:
180 Master_SSL_Cipher:
181 Master_SSL_Key:
182 Seconds_Behind_Master: NULL
183 Master_SSL_Verify_Server_Cert: No
184 Last_IO_Errno: 0
185 Last_IO_Error:
186 Last_SQL_Errno: 0
187 Last_SQL_Error:
188 Replicate_Ignore_Server_Ids:
189 Master_Server_Id: 0
190 Master_SSL_Crl:
191 Master_SSL_Crlpath:
192 Using_Gtid: No
193 Gtid_IO_Pos:
194 Replicate_Do_Domain_Ids:
195 Replicate_Ignore_Domain_Ids:
196 Parallel_Mode: conservative
197 SQL_Delay: 0
198 SQL_Remaining_Delay: NULL
199 Slave_SQL_Running_State:
200 Slave_DDL_Groups: 0
201Slave_Non_Transactional_Groups: 0
202 Slave_Transactional_Groups: 0
203 Retried_transactions: 0
204 Max_relay_log_size: 1073741824
205 Executed_log_entries: 0
206 Slave_received_heartbeats: 0
207 Slave_heartbeat_period: 30.000
208 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
209*************************** 4. row ***************************
210 Connection_name: s4
211 Slave_SQL_State:
212 Slave_IO_State:
213 Master_Host: db1125.eqiad.wmnet
214 Master_User: repl
215 Master_Port: 3314
216 Connect_Retry: 60
217 Master_Log_File: db1125-bin.003591
218 Read_Master_Log_Pos: 728051471
219 Relay_Log_File: db1141-relay-bin-s4.000281
220 Relay_Log_Pos: 728051767
221 Relay_Master_Log_File: db1125-bin.003591
222 Slave_IO_Running: No
223 Slave_SQL_Running: No
224 Replicate_Do_DB:
225 Replicate_Ignore_DB:
226 Replicate_Do_Table:
227 Replicate_Ignore_Table:
228 Replicate_Wild_Do_Table:
229 Replicate_Wild_Ignore_Table:
230 Last_Errno: 0
231 Last_Error:
232 Skip_Counter: 0
233 Exec_Master_Log_Pos: 728051471
234 Relay_Log_Space: 728052395
235 Until_Condition: None
236 Until_Log_File:
237 Until_Log_Pos: 0
238 Master_SSL_Allowed: Yes
239 Master_SSL_CA_File:
240 Master_SSL_CA_Path:
241 Master_SSL_Cert:
242 Master_SSL_Cipher:
243 Master_SSL_Key:
244 Seconds_Behind_Master: NULL
245 Master_SSL_Verify_Server_Cert: No
246 Last_IO_Errno: 0
247 Last_IO_Error:
248 Last_SQL_Errno: 0
249 Last_SQL_Error:
250 Replicate_Ignore_Server_Ids:
251 Master_Server_Id: 0
252 Master_SSL_Crl:
253 Master_SSL_Crlpath:
254 Using_Gtid: No
255 Gtid_IO_Pos:
256 Replicate_Do_Domain_Ids:
257 Replicate_Ignore_Domain_Ids:
258 Parallel_Mode: conservative
259 SQL_Delay: 0
260 SQL_Remaining_Delay: NULL
261 Slave_SQL_Running_State:
262 Slave_DDL_Groups: 0
263Slave_Non_Transactional_Groups: 0
264 Slave_Transactional_Groups: 0
265 Retried_transactions: 0
266 Max_relay_log_size: 1073741824
267 Executed_log_entries: 0
268 Slave_received_heartbeats: 0
269 Slave_heartbeat_period: 30.000
270 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
271*************************** 5. row ***************************
272 Connection_name: s5
273 Slave_SQL_State:
274 Slave_IO_State:
275 Master_Host: db1124.eqiad.wmnet
276 Master_User: repl
277 Master_Port: 3315
278 Connect_Retry: 60
279 Master_Log_File: db1124-bin.001231
280 Read_Master_Log_Pos: 63949745
281 Relay_Log_File: db1141-relay-bin-s5.000111
282 Relay_Log_Pos: 63950041
283 Relay_Master_Log_File: db1124-bin.001231
284 Slave_IO_Running: No
285 Slave_SQL_Running: No
286 Replicate_Do_DB:
287 Replicate_Ignore_DB:
288 Replicate_Do_Table:
289 Replicate_Ignore_Table:
290 Replicate_Wild_Do_Table:
291 Replicate_Wild_Ignore_Table:
292 Last_Errno: 0
293 Last_Error:
294 Skip_Counter: 0
295 Exec_Master_Log_Pos: 63949745
296 Relay_Log_Space: 63950669
297 Until_Condition: None
298 Until_Log_File:
299 Until_Log_Pos: 0
300 Master_SSL_Allowed: Yes
301 Master_SSL_CA_File:
302 Master_SSL_CA_Path:
303 Master_SSL_Cert:
304 Master_SSL_Cipher:
305 Master_SSL_Key:
306 Seconds_Behind_Master: NULL
307 Master_SSL_Verify_Server_Cert: No
308 Last_IO_Errno: 0
309 Last_IO_Error:
310 Last_SQL_Errno: 0
311 Last_SQL_Error:
312 Replicate_Ignore_Server_Ids:
313 Master_Server_Id: 0
314 Master_SSL_Crl:
315 Master_SSL_Crlpath:
316 Using_Gtid: No
317 Gtid_IO_Pos:
318 Replicate_Do_Domain_Ids:
319 Replicate_Ignore_Domain_Ids:
320 Parallel_Mode: conservative
321 SQL_Delay: 0
322 SQL_Remaining_Delay: NULL
323 Slave_SQL_Running_State:
324 Slave_DDL_Groups: 0
325Slave_Non_Transactional_Groups: 0
326 Slave_Transactional_Groups: 0
327 Retried_transactions: 0
328 Max_relay_log_size: 1073741824
329 Executed_log_entries: 0
330 Slave_received_heartbeats: 0
331 Slave_heartbeat_period: 30.000
332 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
333*************************** 6. row ***************************
334 Connection_name: s6
335 Slave_SQL_State:
336 Slave_IO_State:
337 Master_Host: db1125.eqiad.wmnet
338 Master_User: repl
339 Master_Port: 3316
340 Connect_Retry: 60
341 Master_Log_File: db1125-bin.001525
342 Read_Master_Log_Pos: 513418261
343 Relay_Log_File: db1141-relay-bin-s6.000209
344 Relay_Log_Pos: 513418557
345 Relay_Master_Log_File: db1125-bin.001525
346 Slave_IO_Running: No
347 Slave_SQL_Running: No
348 Replicate_Do_DB:
349 Replicate_Ignore_DB:
350 Replicate_Do_Table:
351 Replicate_Ignore_Table:
352 Replicate_Wild_Do_Table:
353 Replicate_Wild_Ignore_Table:
354 Last_Errno: 0
355 Last_Error:
356 Skip_Counter: 0
357 Exec_Master_Log_Pos: 513418261
358 Relay_Log_Space: 513419185
359 Until_Condition: None
360 Until_Log_File:
361 Until_Log_Pos: 0
362 Master_SSL_Allowed: Yes
363 Master_SSL_CA_File:
364 Master_SSL_CA_Path:
365 Master_SSL_Cert:
366 Master_SSL_Cipher:
367 Master_SSL_Key:
368 Seconds_Behind_Master: NULL
369 Master_SSL_Verify_Server_Cert: No
370 Last_IO_Errno: 0
371 Last_IO_Error:
372 Last_SQL_Errno: 0
373 Last_SQL_Error:
374 Replicate_Ignore_Server_Ids:
375 Master_Server_Id: 0
376 Master_SSL_Crl:
377 Master_SSL_Crlpath:
378 Using_Gtid: No
379 Gtid_IO_Pos:
380 Replicate_Do_Domain_Ids:
381 Replicate_Ignore_Domain_Ids:
382 Parallel_Mode: conservative
383 SQL_Delay: 0
384 SQL_Remaining_Delay: NULL
385 Slave_SQL_Running_State:
386 Slave_DDL_Groups: 0
387Slave_Non_Transactional_Groups: 0
388 Slave_Transactional_Groups: 0
389 Retried_transactions: 0
390 Max_relay_log_size: 1073741824
391 Executed_log_entries: 0
392 Slave_received_heartbeats: 0
393 Slave_heartbeat_period: 30.000
394 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
395*************************** 7. row ***************************
396 Connection_name: s7
397 Slave_SQL_State:
398 Slave_IO_State:
399 Master_Host: db1125.eqiad.wmnet
400 Master_User: repl
401 Master_Port: 3317
402 Connect_Retry: 60
403 Master_Log_File: db1125-bin.002201
404 Read_Master_Log_Pos: 788792501
405 Relay_Log_File: db1141-relay-bin-s7.000297
406 Relay_Log_Pos: 788792797
407 Relay_Master_Log_File: db1125-bin.002201
408 Slave_IO_Running: No
409 Slave_SQL_Running: No
410 Replicate_Do_DB:
411 Replicate_Ignore_DB:
412 Replicate_Do_Table:
413 Replicate_Ignore_Table:
414 Replicate_Wild_Do_Table:
415 Replicate_Wild_Ignore_Table:
416 Last_Errno: 0
417 Last_Error:
418 Skip_Counter: 0
419 Exec_Master_Log_Pos: 788792501
420 Relay_Log_Space: 788793425
421 Until_Condition: None
422 Until_Log_File:
423 Until_Log_Pos: 0
424 Master_SSL_Allowed: Yes
425 Master_SSL_CA_File:
426 Master_SSL_CA_Path:
427 Master_SSL_Cert:
428 Master_SSL_Cipher:
429 Master_SSL_Key:
430 Seconds_Behind_Master: NULL
431 Master_SSL_Verify_Server_Cert: No
432 Last_IO_Errno: 0
433 Last_IO_Error:
434 Last_SQL_Errno: 0
435 Last_SQL_Error:
436 Replicate_Ignore_Server_Ids:
437 Master_Server_Id: 0
438 Master_SSL_Crl:
439 Master_SSL_Crlpath:
440 Using_Gtid: No
441 Gtid_IO_Pos:
442 Replicate_Do_Domain_Ids:
443 Replicate_Ignore_Domain_Ids:
444 Parallel_Mode: conservative
445 SQL_Delay: 0
446 SQL_Remaining_Delay: NULL
447 Slave_SQL_Running_State:
448 Slave_DDL_Groups: 0
449Slave_Non_Transactional_Groups: 0
450 Slave_Transactional_Groups: 0
451 Retried_transactions: 0
452 Max_relay_log_size: 1073741824
453 Executed_log_entries: 0
454 Slave_received_heartbeats: 0
455 Slave_heartbeat_period: 30.000
456 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
457*************************** 8. row ***************************
458 Connection_name: s8
459 Slave_SQL_State:
460 Slave_IO_State:
461 Master_Host: db1124.eqiad.wmnet
462 Master_User: repl
463 Master_Port: 3318
464 Connect_Retry: 60
465 Master_Log_File: db1124-bin.003713
466 Read_Master_Log_Pos: 937829874
467 Relay_Log_File: db1141-relay-bin-s8.000281
468 Relay_Log_Pos: 937830170
469 Relay_Master_Log_File: db1124-bin.003713
470 Slave_IO_Running: No
471 Slave_SQL_Running: No
472 Replicate_Do_DB:
473 Replicate_Ignore_DB:
474 Replicate_Do_Table:
475 Replicate_Ignore_Table:
476 Replicate_Wild_Do_Table:
477 Replicate_Wild_Ignore_Table:
478 Last_Errno: 0
479 Last_Error:
480 Skip_Counter: 0
481 Exec_Master_Log_Pos: 937829874
482 Relay_Log_Space: 937830798
483 Until_Condition: None
484 Until_Log_File:
485 Until_Log_Pos: 0
486 Master_SSL_Allowed: Yes
487 Master_SSL_CA_File:
488 Master_SSL_CA_Path:
489 Master_SSL_Cert:
490 Master_SSL_Cipher:
491 Master_SSL_Key:
492 Seconds_Behind_Master: NULL
493 Master_SSL_Verify_Server_Cert: No
494 Last_IO_Errno: 0
495 Last_IO_Error:
496 Last_SQL_Errno: 0
497 Last_SQL_Error:
498 Replicate_Ignore_Server_Ids:
499 Master_Server_Id: 0
500 Master_SSL_Crl:
501 Master_SSL_Crlpath:
502 Using_Gtid: No
503 Gtid_IO_Pos:
504 Replicate_Do_Domain_Ids:
505 Replicate_Ignore_Domain_Ids:
506 Parallel_Mode: conservative
507 SQL_Delay: 0
508 SQL_Remaining_Delay: NULL
509 Slave_SQL_Running_State:
510 Slave_DDL_Groups: 0
511Slave_Non_Transactional_Groups: 0
512 Slave_Transactional_Groups: 0
513 Retried_transactions: 0
514 Max_relay_log_size: 1073741824
515 Executed_log_entries: 0
516 Slave_received_heartbeats: 0
517 Slave_heartbeat_period: 30.000
518 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046

Good news, db1141 has been restarted 10 minutes ago, and so far no crashes and no errors on the log. All clean.
I am going to wait a few more minutes, but then I would like to pool it on analytics role, to see if that triggers something.

Change 599153 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Pool db1141 into analytics role

https://gerrit.wikimedia.org/r/599153

Mentioned in SAL (#wikimedia-operations) [2020-05-28T04:44:11Z] <marostegui> Run check_private data on db1141 - T249188

db1141 check data finished up clean, I have checked grants, roles and query killer, and the host is ready to be pooled.

Change 599153 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Pool db1141 into analytics role

https://gerrit.wikimedia.org/r/599153

Mentioned in SAL (#wikimedia-operations) [2020-05-28T08:13:56Z] <marostegui> Pool db1141 into labsdb analytics role - T249188

db1141 is now serving the analytics role. I can query it finely and I can see other connections arriving from the proxy:

marostegui@tools-sgebastion-07:~$ sql --cluster analytics enwiki_p
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 13325
Server version: 10.4.13-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [enwiki_p]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1141     |
+------------+
1 row in set (0.00 sec)

MariaDB [enwiki_p]>

From quarry it also works fine: https://quarry.wmflabs.org/query/45259

And from the proxy:

|  6497 | s52927          | 10.64.37.27:37470 | commonswiki_p      | Sleep     |   110 |                                                                             | NULL                                                                                                 |    0.000 |
|  7662 | s51295          | 10.64.37.27:39798 | commonswiki_p      | Sleep     |    33 |                                                                             | NULL                                                                                                 |    0.000 |
|  9985 | s51295          | 10.64.37.27:44440 | commonswiki_p      | Sleep     |    32 |                                                                             | NULL                                                                                                 |    0.000 |
|  9997 | s52741          | 10.64.37.27:44464 | cswiki_p           | Sleep     |    88 |                                                                             | NULL                                                                                                 |    0.000 |
| 10078 | s52788          | 10.64.37.27:44626 | enwiki_p           | Sleep     |    88 |                                                                             | NULL                                                                                                 |    0.000 |
| 11166 | s52788          | 10.64.37.27:46806 | enwiki_p           | Sleep     |    81 |                                                                             | NULL                                                                                                 |    0.000 |
| 12155 | s54113          | 10.64.37.27:48784 | commonswiki_p      | Sleep     |    75 |                                                                             | NULL                                                                                                 |    0.000 |
| 12728 | s54113          | 10.64.37.27:49932 | commonswiki_p      | Sleep     |    71 |                                                                             | NULL                                                                                                 |    0.000 |
| 13325 | u15343          | 10.64.37.27:51124 | enwiki_p           | Sleep     |    62 |                                                                             | NULL                                                                                                 |    0.000 |
| 14469 | s54209          | 10.64.37.27:53414 | dewiki_p           | Query     |    59 | Creating sort index                                                         | select page_namespace, page_title from page, templatelinks where tl_from = page_id and tl_title = 'K |    0.000 |
| 16902 | s51592          | 10.64.37.27:58288 | svwiki_p           | Sleep     |    43 |                                                                             | NULL                                                                                                 |    0.000 |
| 17202 | s52927          | 10.64.37.27:58890 | commonswiki_p      | Sleep     |    41 |                                                                             | NULL                                                                                                 |    0.000 |
| 20995 | s52927          | 10.64.37.27:38242 | commonswiki_p      | Sleep     |    18 |                                                                             | NULL                                                                                                 |    0.000 |
| 21001 | s52927          | 10.64.37.27:38254 | commonswiki_p      | Sleep     |    18 |                                                                             | NULL                                                                                                 |    0.000 |
| 21009 | s52835          | 10.64.37.27:38270 | enwiki_p           | Query     |    18 | Sending data                                                                | select page_namespace, ll_title, pa_importance, pa_class
    from page, langlinks, page_assessments, |    0.000 |
| 21235 | s51223          | 10.64.37.27:38722 | itwiki_p           | Query     |    17 | Sending data                                                                | SELECT CONCAT('# [[', page_title, ']]')
FROM page
WHERE page_namespace = 0
AND page_is_redirect = 0
 |    0.000 |
| 22399 | s54113          | 10.64.37.27:41054 | commonswiki_p      | Sleep     |    11 |                                                                             | NULL                                                                                                 |    0.000 |
| 23541 | s51469          | 10.64.37.27:43338 | enwiki_p           | Query     |     4 | Sending data                                                                | SELECT rev_id, rev_len, page_title, page_namespace, rev_timestamp, com.comment_text, page_is_redirec |    0.000 |
| 23543 | s51469          | 10.64.37.27:43342 | enwiki_p           | Query     |     4 | Sending data                                                                | SELECT COUNT(*)
FROM revision_userindex JOIN page ON page_id=rev_page
INNER JOIN comment com ON rev_ |    0.000 |
| 23644 | u12903          | 10.64.37.27:43542 | hewiki_p           | Query     |     0 | Sending data                                                                | select /*SLOW_OK updatestats*/ r.*, page_id, page_title, page_namespace, page_is_redirect, rp.rev_le |    0.000 |
| 23781 | u12903          | 10.64.37.27:43818 | lijwiki_p          | Sleep     |     1 |                                                                             | NULL                                                                                                 |    0.000 |
| 23939 | u12903          | 10.64.37.27:44134 | plwikinews_p       | Sleep     |     0 |                                                                             | NULL                                                                                                 |    0.000 |
| 24050 | s52835          | 10.64.37.27:44356 | frwiki_p           | Query     |     1 | Sending data                                                                | select page_namespace, ll_title, pa_importance, pa_class
    from page, langlinks, page_assessments, |    0.000 |
| 24250 | s52835          | 10.64.37.27:44766 | frwiki_p           | Query     |     0 | Sending data                                                                | select page_namespace, ll_title, pa_importance, pa_class
    from page, langlinks, page_assessments, |    0.000 |
| 24268 | root            | localhost         | NULL               | Query     |     0 | Init                                                                        | show processlist                                                                                     |    0.000 |
+-------+-----------------+-------------------+--------------------+-----------+-------+-----------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
root@db1141:~#

Change 599282 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1141: Enable notifications

https://gerrit.wikimedia.org/r/599282

Change 599282 merged by Marostegui:
[operations/puppet@production] db1141: Enable notifications

https://gerrit.wikimedia.org/r/599282

Removed the binary backup from labsdb1011 from /srv/production/labsdb1011 which was around 6.6T
On the other hand, I am copying the logical dump made on db1141 BEFORE it was restarted once it was in sync with the master (859G) to: backup1001:/srv/production/db1141_logical_once_replication_caught_up (19TB free there).

db1141 keeps serving traffic without any errors or issues. Those are good news.

I am going to leave it running like this till Monday.
If things continue to be fine on Monday, I think I will just do a last test with:

  • Stop MySQL on db1141
  • Copy all the data to labsdb1011
  • Start both hosts, so we can check two things
    • if db1141 keeps working fine after a second restart
    • see if labsdb1011 crashes again with a different set of binary data (coming directly from an already migrated 10.4 host)
  • If db1141 keeps working fine for a few more days after that third restart, I will probably:
    • Put it down, and copy a binary backup from it
    • Attempt to do an in-place upgrade on labsdb1010.

Change 599466 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wmcs: Add db1141.eqiad.wmnet to maintain-dbusers

https://gerrit.wikimedia.org/r/599466

Change 599466 merged by Bstorm:
[operations/puppet@production] wmcs: Add db1141.eqiad.wmnet to maintain-dbusers

https://gerrit.wikimedia.org/r/599466

Change 601138 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Depool db1141 to take a binary dump

https://gerrit.wikimedia.org/r/601138

db1141 has been working fine during the weekend - no trace of possible crashes or the error that precedes the crashes on the logs.
Going to depool lit, take a binary dump and then attempt to upgrade labsdb1010.

If all this works fine, we need to pool labsdb1010 on Analytics role with some weight, otherwise it seems that db1141 cannot serve that role on its own and keeps lagging forever (sort of what we already saw with labsdb1011)

Change 601138 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Depool db1141 to take a binary dump

https://gerrit.wikimedia.org/r/601138

Mentioned in SAL (#wikimedia-operations) [2020-06-01T04:44:08Z] <marostegui> Depool db1141 from Analytics role - T249188

The logical backup mentioned at T249188#6171874, which used to be at backup1001:/srv/production/db1141_logical_once_replication_caught_up has been moved to backup1002:/srv/backups/T249188/ongoing