Page MenuHomePhabricator

Reimage labsdb1011 to Buster and MariaDB 10.4
Closed, DeclinedPublic

Description

Recently, we have seen issues with Quarry and labsdb1011 (T247978 T246970)
Labsdb1011 is always very overloaded (https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=labsdb1011&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=now-24h&to=now&fullscreen&panelId=3) and we've seen in s8 that 10.4 and Buster seems to be performing better, CPU-wise.
Let's try to reimage this host with Buster and see what happens.

/srv won't be formatted.

The rollback plan would be to reimage to stretch and reclone from labsdb1012.

Once the host has no traffic, let's do a disk performance benchmarking to make sure its IOPS are similar to labsdb1012 ones.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+13 -10
operations/puppetproduction+10 -7
operations/puppetproduction+0 -1
operations/puppetproduction+2 -1
operations/puppetproduction+4 -4
operations/puppetproduction+4 -0
operations/puppetproduction+2 -1
operations/puppetproduction+4 -4
operations/puppetproduction+7 -7
operations/puppetproduction+4 -4
operations/puppetproduction+4 -4
operations/puppetproduction+4 -4
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+6 -4
operations/puppetproduction+1 -1
operations/puppetproduction+4 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -4
operations/puppetproduction+0 -1
operations/puppetproduction+7 -5
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

labsdb1011 keeps catching up nicely:

root@labsdb1011:~# mysql -e "show all slaves status\G" | grep Seconds
         Seconds_Behind_Master: 115303
         Seconds_Behind_Master: 9768
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 139544
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 48016

labsdb1011 is up-to-date:

#  mysql.py -hlabsdb1011 -e "show all slaves status\G" | grep Seconds
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0
         Seconds_Behind_Master: 0

Change 597762 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/597762

Change 597762 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/597762

Mentioned in SAL (#wikimedia-operations) [2020-05-21T12:12:51Z] <marostegui> Repool labsdb1011 into the analytics role 🤞- T249188

labsdb1011 is now serving queries.
Quarry seems to be working fine too: https://quarry.wmflabs.org/query/45075

labsdb1011 seems to be working fine.
Good news is that 10.4+Buster seems to confirm that the CPU usage is a lot better and the host isn't having almost 100% usage as it used to.

Captura de pantalla 2020-05-22 a las 6.45.45.png (766×1 px, 235 KB)

So labsdb1011 looks stable. CPU seems to be stable at around 30% usage (which is a big improvement compared to the previous values).

Lag grows, but that is something not uncommon for the analytics role.
This is how labsdb1010 behaved for the weeks it's been serving the analytics role:

Captura de pantalla 2020-05-22 a las 13.31.20.png (759×1 px, 501 KB)

I am going to leave labsdb1011 working for the weekend, but on Monday we should do the last test to see if the restart is what causes the corruption.
My idea is to:

  • Depool labsdb1011
  • Let all the running queries finish
  • Let the slave lag catch up on all the threads
  • Stop mysql
  • Start mysql
  • Monitoring log
  • Repool the host back.
  • Check again for a few days if crashes re-appear.

Labsdb1011 has been working fine since Thursday, the lag though keeps growing now less fast, but still there.
I am going to go ahead and do the above test:

Depool labsdb1011
Let all the running queries finish
Let the slave lag catch up on all the threads
Stop mysql
Start mysql
Monitoring log
Repool the host back.
Check again for a few days if crashes re-appear.

If all this goes fine and labsdb1011 keeps working fine for a few more days, maybe we should pool labsdb1010 to balance queries for the analytics role rather than the web role, which seems to have no load/lag issues.

Change 598294 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598294

Change 598294 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598294

Mentioned in SAL (#wikimedia-operations) [2020-05-25T11:21:47Z] <marostegui> Extend db1141's (temporary labsdb test host) /srv 1TB extra - T249188

Mentioned in SAL (#wikimedia-operations) [2020-05-26T04:13:58Z] <marostegui> Stop slaves and stop mysql on labsdb1011 T249188

So looks like the restart is the one causing issues.
The stop slaves reported no issues, neither did the stop mysql.
Starting mysql brought no issues, but as soon as I started the replication threads....bad news:

May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 31 [Note] Master 's6': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3316' in log 'db1125-bin.001516' at position 150259934
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 33 [Note] Master 's2': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3312' in log 'db1125-bin.002403' at position 440364672
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 35 [Note] Master 's3': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3313' in log 'db1124-bin.002091' at position 844723555
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 37 [Note] Master 's4': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3314' in log 'db1125-bin.003581' at position 680368537
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 39 [Note] Master 's8': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3318' in log 'db1124-bin.003703' at position 26857432
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 41 [Note] Master 's5': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3315' in log 'db1124-bin.001226' at position 889729958
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 43 [Note] Master 's7': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3317' in log 'db1125-bin.002189' at position 486105063
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 45 [Note] Master 's1': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3311' in log 'db1124-bin.002696' at position 813770887
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 37 [Note] Master 's4': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3314',replication started in log 'db1125-bin.003581' at position 680368537
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 35 [Note] Master 's3': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3313',replication started in log 'db1124-bin.002091' at position 844723555
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 43 [Note] Master 's7': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3317',replication started in log 'db1125-bin.002189' at position 486105063
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 45 [Note] Master 's1': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3311',replication started in log 'db1124-bin.002696' at position 813770887
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 33 [Note] Master 's2': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3312',replication started in log 'db1125-bin.002403' at position 440364672
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 31 [Note] Master 's6': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3316',replication started in log 'db1125-bin.001516' at position 150259934
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 41 [Note] Master 's5': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3315',replication started in log 'db1124-bin.001226' at position 889729958
May 26 04:20:39 labsdb1011 mysqld[19806]: 2020-05-26  4:20:39 39 [Note] Master 's8': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3318',replication started in log 'db1124-bin.003703' at position 26857432
May 26 04:20:44 labsdb1011 mysqld[19806]: 2020-05-26  4:20:44 4 [ERROR] InnoDB: Unable to find a record to delete-mark
May 26 04:20:44 labsdb1011 mysqld[19806]: InnoDB: tuple DATA TUPLE: 4 fields;
May 26 04:20:44 labsdb1011 mysqld[19806]:  0: len 4; hex 80000004; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  1: len 4; hex 80000000; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  2: len 22; hex 5468655f47797073795f616e645f7468655f4b696e67; asc The_Gypsy_and_the_King;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  3: len 4; hex 00109a35; asc    5;;
May 26 04:20:44 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 4; compact format; info bits 0
May 26 04:20:44 labsdb1011 mysqld[19806]:  0: len 4; hex 80000004; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  1: len 4; hex 80000000; asc     ;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  2: len 15; hex 5468655f47797073795f547261696c; asc The_Gypsy_Trail;;
May 26 04:20:44 labsdb1011 mysqld[19806]:  3: len 4; hex 02e4fc4d; asc    M;;
May 26 04:20:44 labsdb1011 mysqld[19806]: 2020-05-26  4:20:44 4 [ERROR] InnoDB: page [page id: space=94970, page number=4807528] (305 records, index id 334204).
May 26 04:20:44 labsdb1011 mysqld[19806]: 2020-05-26  4:20:44 4 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 26 04:20:45 labsdb1011 mysqld[19806]: 2020-05-26  4:20:45 4 [ERROR] InnoDB: Unable to find a record to delete-mark
May 26 04:20:45 labsdb1011 mysqld[19806]: InnoDB: tuple DATA TUPLE: 3 fields;
May 26 04:20:45 labsdb1011 mysqld[19806]:  0: len 4; hex 80000000; asc     ;;
May 26 04:20:45 labsdb1011 mysqld[19806]:  1: len 19; hex 484d535f5265736f6c7574655f283138353029; asc HMS_Resolute_(1850);;
May 26 04:20:45 labsdb1011 mysqld[19806]:  2: len 4; hex 02ea5d4b; asc   ]K;;
May 26 04:20:45 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
May 26 04:20:45 labsdb1011 mysqld[19806]:  0: len 4; hex 80000000; asc     ;;
May 26 04:20:45 labsdb1011 mysqld[19806]:  1: len 19; hex 484d535f5265736f6c7574655f283138353029; asc HMS_Resolute_(1850);;
May 26 04:20:45 labsdb1011 mysqld[19806]:  2: len 4; hex 02e860c8; asc   ` ;;
May 26 04:20:45 labsdb1011 mysqld[19806]: 2020-05-26  4:20:45 4 [ERROR] InnoDB: page [page id: space=94970, page number=1579497] (421 records, index id 334203).
May 26 04:20:45 labsdb1011 mysqld[19806]: 2020-05-26  4:20:45 4 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 26 04:20:47 labsdb1011 mysqld[19806]: 2020-05-26  4:20:47 46 [ERROR] Master 's1': InnoDB: Record in index `pl_backlinks_namespace` of table `enwiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[22]The_Gypsy_and_the_King(0x5468655F47797073795F616E645F7468655F4B696E67),[4]  e (0x000665A4)} at: COMPACT RECORD(info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[15]The_Gypsy_Trail(0x5468655F47797073795F547261696C),[4]   M(0x02E4FC4D)}
May 26 04:20:53 labsdb1011 mysqld[19806]: 2020-05-26  4:20:53 3 [ERROR] InnoDB: Unable to find a record to delete-mark
May 26 04:20:53 labsdb1011 mysqld[19806]: InnoDB: tuple DATA TUPLE: 3 fields;

May 26 04:21:51 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 4; compact format; info bits 0
May 26 04:21:51 labsdb1011 mysqld[19806]:  0: len 4; hex 80000002; asc     ;;
May 26 04:21:51 labsdb1011 mysqld[19806]:  1: len 4; hex 80000000; asc     ;;
May 26 04:21:51 labsdb1011 mysqld[19806]:  2: len 10; hex 43616c696d6572697573; asc Calimerius;;
May 26 04:21:51 labsdb1011 mysqld[19806]:  3: len 4; hex 00dac0fe; asc     ;;
May 26 04:21:51 labsdb1011 mysqld[19806]: 2020-05-26  4:21:51 4 [ERROR] InnoDB: page [page id: space=94970, page number=10530772] (382 records, index id 334204).
May 26 04:21:51 labsdb1011 mysqld[19806]: 2020-05-26  4:21:51 4 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 26 04:22:01 labsdb1011 mysqld[19806]: 2020-05-26  4:22:01 46 [ERROR] Master 's1': InnoDB: Record in index `pl_backlinks_namespace` of table `enwiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[22]The_Gypsy_and_the_King(0x5468655F47797073795F616E645F7468655F4B696E67),[4]    (0x00100CFD)} at: COMPACT RECORD(info_bits=0, 4 fields): {[4]    (0x80000004),[4]    (0x80000000),[15]The_Gypsy_Trail(0x5468655F47797073795F547261696C),[4]   M(0x02E4FC4D)}

I am going to leave this running for a few minutes, but I want to pool this back to see if it really crashes, as this hasn't crashed yet.

I have repooled the host and the queries are arriving.

The errors stopped and they are definitely not happening as fast as they used to.
This was the last one:

May 26 04:34:18 labsdb1011 mysqld[19806]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
May 26 04:34:18 labsdb1011 mysqld[19806]:  0: len 4; hex 006ff6ff; asc  o  ;;
May 26 04:34:18 labsdb1011 mysqld[19806]:  1: len 30; hex 687474703a2f2f676f762e6e6173612e6a706c2e7373642e2f736264622e; asc http://gov.nasa.jpl.ssd./sbdb.; (total 54 bytes);
May 26 04:34:18 labsdb1011 mysqld[19806]:  2: len 4; hex 0106a2c5; asc     ;;
May 26 04:34:18 labsdb1011 mysqld[19806]: space 98868 offset 965562 (183 records, index id 347862)

Still no more errors, the only new message was, interestingly:

May 26 04:41:29 labsdb1011 mysqld[19806]: 2020-05-26  4:41:29 0 [Note] InnoDB: Buffer pool(s) load completed at 200526  4:41:28

Maybe all those above are related to the load process of the buffer pool?

Interesting, I just saw this:
https://jira.mariadb.org/browse/MDEV-22497 which was closed a few days ago

This is exactly the error we are seeing and seems to be fixed on 10.4.13 (we are running 10.4.12 on labsdb1011, even though we do have .13 on the repo).
I am going to upgrade this version and "reclone" this host again from the backup1002 files.

Interesting, I just saw this:
https://jira.mariadb.org/browse/MDEV-22497 which was closed a few days ago

This is exactly the error we are seeing and seems to be fixed on 10.4.13 (we are running 10.4.12 on labsdb1011, even though we do have .13 on the repo).
I am going to upgrade this version and "reclone" this host again from the backup1002 files.

I just realised that this host does run 10.4.13 already :(

Change 598571 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/598571

Change 598571 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Repool labsdb1011

https://gerrit.wikimedia.org/r/598571

Situation as of now: I have repooled labsdb1011, it keeps having some of those errors, but it is not crashing. I want to see what happens once it gets load and normal usage.

The host keeps performing fine, serving queries and having no lag -it has had no crashes despite the fact that it keeps logging those errors from time to time.

Change 598691 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Add labsdb1010 with reduced weight

https://gerrit.wikimedia.org/r/598691

And the host finally crashed:

May 27 02:11:02 labsdb1011 mysqld[23527]: 2020-05-27 02:11:02 0x7fc65c207700  InnoDB: Assertion failure in file /root/mariadb-10.4.13/storage/innobase/row/row0ins.cc line 231
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: Failing assertion: !cursor->index->is_committed()
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: We intentionally generate a memory trap.
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: If you get repeated assertion failures or crashes, even
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: immediately after the mysqld startup, there may be
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: corruption in the InnoDB tablespace. Please refer to
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
May 27 02:11:02 labsdb1011 mysqld[23527]: InnoDB: about forcing recovery.
May 27 02:11:02 labsdb1011 mysqld[23527]: 200527  2:11:02 [ERROR] mysqld got signal 6 ;
May 27 02:11:02 labsdb1011 mysqld[23527]: This could be because you hit a bug. It is also possible that this binary
May 27 02:11:02 labsdb1011 mysqld[23527]: or one of the libraries it was linked against is corrupt, improperly built,
May 27 02:11:02 labsdb1011 mysqld[23527]: or misconfigured. This error can also be caused by malfunctioning hardware.
May 27 02:11:02 labsdb1011 mysqld[23527]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
May 27 02:11:02 labsdb1011 mysqld[23527]: We will try our best to scrape up some info that will hopefully help
May 27 02:11:02 labsdb1011 mysqld[23527]: diagnose the problem, but since we have already crashed,
May 27 02:11:02 labsdb1011 mysqld[23527]: something is definitely wrong and this may fail.
May 27 02:11:02 labsdb1011 mysqld[23527]: Server version: 10.4.13-MariaDB
May 27 02:11:02 labsdb1011 mysqld[23527]: key_buffer_size=134217728
May 27 02:11:02 labsdb1011 mysqld[23527]: read_buffer_size=131072
May 27 02:11:02 labsdb1011 mysqld[23527]: max_used_connections=125
May 27 02:11:02 labsdb1011 mysqld[23527]: max_threads=1026
May 27 02:11:02 labsdb1011 mysqld[23527]: thread_count=148
May 27 02:11:02 labsdb1011 mysqld[23527]: It is possible that mysqld could use up to
May 27 02:11:02 labsdb1011 mysqld[23527]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2388976 K  bytes of memory
May 27 02:11:02 labsdb1011 mysqld[23527]: Hope that's ok; if not, decrease some variables in the equation.
May 27 02:11:02 labsdb1011 mysqld[23527]: Thread pointer: 0x7f5f1c0014f8
May 27 02:11:02 labsdb1011 mysqld[23527]: Attempting backtrace. You can use the following information to find out
May 27 02:11:02 labsdb1011 mysqld[23527]: where mysqld died. If you see no messages after this, something went
May 27 02:11:02 labsdb1011 mysqld[23527]: terribly wrong...
May 27 02:11:02 labsdb1011 mysqld[23527]: stack_bottom = 0x7fc65c206698 thread_stack 0x30000
May 27 02:11:04 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(my_print_stacktrace+0x2e)[0x556d7c07f7de]
May 27 02:11:06 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(handle_fatal_signal+0x54d)[0x556d7bb77b4d]
May 27 02:11:08 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7fc67eb8e730]
May 27 02:11:10 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7fc67e1f67bb]
May 27 02:11:11 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7fc67e1e1535]
May 27 02:11:13 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0x5a19d5)[0x556d7b8739d5]
May 27 02:11:15 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0x5907c1)[0x556d7b8627c1]
May 27 02:11:17 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xaeee0e)[0x556d7bdc0e0e]
May 27 02:11:19 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xaef4bd)[0x556d7bdc14bd]
May 27 02:11:21 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xaff23a)[0x556d7bdd123a]
May 27 02:11:22 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xa4dbd5)[0x556d7bd1fbd5]
May 27 02:11:24 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN7handler12ha_write_rowEPKh+0x14d)[0x556d7bb837ed]
May 27 02:11:26 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN14Rows_log_event9write_rowEP14rpl_group_infob+0x174)[0x556d7bc755d4]
May 27 02:11:28 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN20Write_rows_log_event11do_exec_rowEP14rpl_group_info+0x7d)[0x556d7bc75b6d]
May 27 02:11:30 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x23c)[0x556d7bc69f8c]
May 27 02:11:31 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0x5faf02)[0x556d7b8ccf02]
May 27 02:11:33 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(handle_slave_sql+0x12e2)[0x556d7b8d5ef2]
May 27 02:11:35 labsdb1011 mysqld[23527]: /opt/wmf-mariadb104/bin/mysqld(+0xd5e28b)[0x556d7c03028b]
May 27 02:11:37 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3)[0x7fc67eb83fa3]
May 27 02:11:39 labsdb1011 mysqld[23527]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fc67e2b84cf]
May 27 02:11:39 labsdb1011 mysqld[23527]: Trying to get some variables.
May 27 02:11:39 labsdb1011 mysqld[23527]: Some pointers may be invalid and cause the dump to abort.
May 27 02:11:39 labsdb1011 mysqld[23527]: Query (0x0):
May 27 02:11:39 labsdb1011 mysqld[23527]: Connection ID (thread ID): 88
May 27 02:11:39 labsdb1011 mysqld[23527]: Status: NOT_KILLED
May 27 02:11:39 labsdb1011 mysqld[23527]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
May 27 02:11:39 labsdb1011 mysqld[23527]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
May 27 02:11:39 labsdb1011 mysqld[23527]: information that should help you find out what is causing the crash.
May 27 02:11:39 labsdb1011 mysqld[23527]: Writing a core file...
May 27 02:11:39 labsdb1011 mysqld[23527]: Working directory at /srv/sqldata
May 27 02:11:39 labsdb1011 mysqld[23527]: Resource Limits:
May 27 02:11:39 labsdb1011 mysqld[23527]: Limit                     Soft Limit           Hard Limit           Units
May 27 02:11:39 labsdb1011 mysqld[23527]: Max cpu time              unlimited            unlimited            seconds
May 27 02:11:39 labsdb1011 mysqld[23527]: Max file size             unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max data size             unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max stack size            8388608              unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max core file size        0                    0                    bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max resident set          unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max processes             2063395              2063395              processes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max open files            200001               200001               files
May 27 02:11:39 labsdb1011 mysqld[23527]: Max locked memory         65536                65536                bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max address space         unlimited            unlimited            bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max file locks            unlimited            unlimited            locks
May 27 02:11:39 labsdb1011 mysqld[23527]: Max pending signals       2063395              2063395              signals
May 27 02:11:39 labsdb1011 mysqld[23527]: Max msgqueue size         819200               819200               bytes
May 27 02:11:39 labsdb1011 mysqld[23527]: Max nice priority         0                    0
May 27 02:11:39 labsdb1011 mysqld[23527]: Max realtime priority     0                    0
May 27 02:11:39 labsdb1011 mysqld[23527]: Max realtime timeout      unlimited            unlimited            us
May 27 02:11:39 labsdb1011 mysqld[23527]: Core pattern: /var/tmp/core/core.%h.%e.%p....
May 27 02:12:13 labsdb1011 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT

Change 598909 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598909

@Kormat @jcrespo db1141 has caught up with replication, let's get mysqldump from it and attempt a restart?

I am thinking to reimage labsdb1011 back to Stretch and reimport a mysqldump, so maybe we can use db1141 to extract it and then see if the reimage ends up with the host crashing too?

Change 598909 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Depool labsdb1011

https://gerrit.wikimedia.org/r/598909

Taking a mydumper from db1141 using:

root@db1141:/srv# /usr/bin/mydumper --compress --events --triggers --routines --logfile /srv/backups/dumps/ongoing/db1141.log --outputdir /srv/backups/dumps/ongoing/dump.db1141_27_05_2020 --rows 20000000 --threads 18 --host localhost --user root

Yep, it has 2.1T available and the dump on labsdb1009 is around 880GB

Positions where db1141 was stopped for restart:

1May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4488 [Note] Master 's2': Slave SQL thread exiting, replication stopped in log 'db1125-bin.002418' at position 403740127
2May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4432 [Note] Master 's2': Slave I/O thread exiting, read up to log 'db1125-bin.002418', position 403740127
3May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4489 [Note] Master 's6': Slave SQL thread exiting, replication stopped in log 'db1125-bin.001525' at position 513418261
4May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4433 [Note] Master 's6': Slave I/O thread exiting, read up to log 'db1125-bin.001525', position 513418261
5May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4490 [Note] Master 's3': Slave SQL thread exiting, replication stopped in log 'db1124-bin.002105' at position 885891780
6May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4434 [Note] Master 's3': Slave I/O thread exiting, read up to log 'db1124-bin.002105', position 885891780
7May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4491 [Note] Master 's4': Slave SQL thread exiting, replication stopped in log 'db1125-bin.003591' at position 728051471
8May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4435 [Note] Master 's4': Slave I/O thread exiting, read up to log 'db1125-bin.003591', position 728051471
9May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4492 [Note] Master 's5': Slave SQL thread exiting, replication stopped in log 'db1124-bin.001231' at position 63949745
10May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4436 [Note] Master 's5': Slave I/O thread exiting, read up to log 'db1124-bin.001231', position 63949745
11May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4493 [Note] Master 's8': Slave SQL thread exiting, replication stopped in log 'db1124-bin.003713' at position 937829874
12May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4437 [Note] Master 's8': Slave I/O thread exiting, read up to log 'db1124-bin.003713', position 937829874
13May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4494 [Note] Master 's7': Slave SQL thread exiting, replication stopped in log 'db1125-bin.002201' at position 788792501
14May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4438 [Note] Master 's7': Slave I/O thread exiting, read up to log 'db1125-bin.002201', position 788792501
15May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4495 [Note] Master 's1': Slave SQL thread exiting, replication stopped in log 'db1124-bin.002711' at position 475182352
16May 28 04:21:09 db1141 mysqld[23899]: 2020-05-28 4:21:09 4439 [Note] Master 's1': Slave I/O thread exiting, read up to log 'db1124-bin.002711', position 475182352
17
18
19
20
21
22root@db1141:~# mysql -e "show all slaves status\G"
23*************************** 1. row ***************************
24 Connection_name: s1
25 Slave_SQL_State:
26 Slave_IO_State:
27 Master_Host: db1124.eqiad.wmnet
28 Master_User: repl
29 Master_Port: 3311
30 Connect_Retry: 60
31 Master_Log_File: db1124-bin.002711
32 Read_Master_Log_Pos: 475182352
33 Relay_Log_File: db1141-relay-bin-s1.000321
34 Relay_Log_Pos: 475182648
35 Relay_Master_Log_File: db1124-bin.002711
36 Slave_IO_Running: No
37 Slave_SQL_Running: No
38 Replicate_Do_DB:
39 Replicate_Ignore_DB:
40 Replicate_Do_Table:
41 Replicate_Ignore_Table:
42 Replicate_Wild_Do_Table:
43 Replicate_Wild_Ignore_Table:
44 Last_Errno: 0
45 Last_Error:
46 Skip_Counter: 0
47 Exec_Master_Log_Pos: 475182352
48 Relay_Log_Space: 475183276
49 Until_Condition: None
50 Until_Log_File:
51 Until_Log_Pos: 0
52 Master_SSL_Allowed: Yes
53 Master_SSL_CA_File:
54 Master_SSL_CA_Path:
55 Master_SSL_Cert:
56 Master_SSL_Cipher:
57 Master_SSL_Key:
58 Seconds_Behind_Master: NULL
59 Master_SSL_Verify_Server_Cert: No
60 Last_IO_Errno: 0
61 Last_IO_Error:
62 Last_SQL_Errno: 0
63 Last_SQL_Error:
64 Replicate_Ignore_Server_Ids:
65 Master_Server_Id: 0
66 Master_SSL_Crl:
67 Master_SSL_Crlpath:
68 Using_Gtid: No
69 Gtid_IO_Pos:
70 Replicate_Do_Domain_Ids:
71 Replicate_Ignore_Domain_Ids:
72 Parallel_Mode: conservative
73 SQL_Delay: 0
74 SQL_Remaining_Delay: NULL
75 Slave_SQL_Running_State:
76 Slave_DDL_Groups: 0
77Slave_Non_Transactional_Groups: 0
78 Slave_Transactional_Groups: 0
79 Retried_transactions: 0
80 Max_relay_log_size: 1073741824
81 Executed_log_entries: 0
82 Slave_received_heartbeats: 0
83 Slave_heartbeat_period: 30.000
84 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
85*************************** 2. row ***************************
86 Connection_name: s2
87 Slave_SQL_State:
88 Slave_IO_State:
89 Master_Host: db1125.eqiad.wmnet
90 Master_User: repl
91 Master_Port: 3312
92 Connect_Retry: 60
93 Master_Log_File: db1125-bin.002418
94 Read_Master_Log_Pos: 403740127
95 Relay_Log_File: db1141-relay-bin-s2.000305
96 Relay_Log_Pos: 403740423
97 Relay_Master_Log_File: db1125-bin.002418
98 Slave_IO_Running: No
99 Slave_SQL_Running: No
100 Replicate_Do_DB:
101 Replicate_Ignore_DB:
102 Replicate_Do_Table:
103 Replicate_Ignore_Table:
104 Replicate_Wild_Do_Table:
105 Replicate_Wild_Ignore_Table:
106 Last_Errno: 0
107 Last_Error:
108 Skip_Counter: 0
109 Exec_Master_Log_Pos: 403740127
110 Relay_Log_Space: 403741051
111 Until_Condition: None
112 Until_Log_File:
113 Until_Log_Pos: 0
114 Master_SSL_Allowed: Yes
115 Master_SSL_CA_File:
116 Master_SSL_CA_Path:
117 Master_SSL_Cert:
118 Master_SSL_Cipher:
119 Master_SSL_Key:
120 Seconds_Behind_Master: NULL
121 Master_SSL_Verify_Server_Cert: No
122 Last_IO_Errno: 0
123 Last_IO_Error:
124 Last_SQL_Errno: 0
125 Last_SQL_Error:
126 Replicate_Ignore_Server_Ids:
127 Master_Server_Id: 0
128 Master_SSL_Crl:
129 Master_SSL_Crlpath:
130 Using_Gtid: No
131 Gtid_IO_Pos:
132 Replicate_Do_Domain_Ids:
133 Replicate_Ignore_Domain_Ids:
134 Parallel_Mode: conservative
135 SQL_Delay: 0
136 SQL_Remaining_Delay: NULL
137 Slave_SQL_Running_State:
138 Slave_DDL_Groups: 0
139Slave_Non_Transactional_Groups: 0
140 Slave_Transactional_Groups: 0
141 Retried_transactions: 0
142 Max_relay_log_size: 1073741824
143 Executed_log_entries: 0
144 Slave_received_heartbeats: 0
145 Slave_heartbeat_period: 30.000
146 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
147*************************** 3. row ***************************
148 Connection_name: s3
149 Slave_SQL_State:
150 Slave_IO_State:
151 Master_Host: db1124.eqiad.wmnet
152 Master_User: repl
153 Master_Port: 3313
154 Connect_Retry: 60
155 Master_Log_File: db1124-bin.002105
156 Read_Master_Log_Pos: 885891780
157 Relay_Log_File: db1141-relay-bin-s3.000341
158 Relay_Log_Pos: 885892076
159 Relay_Master_Log_File: db1124-bin.002105
160 Slave_IO_Running: No
161 Slave_SQL_Running: No
162 Replicate_Do_DB:
163 Replicate_Ignore_DB:
164 Replicate_Do_Table:
165 Replicate_Ignore_Table:
166 Replicate_Wild_Do_Table:
167 Replicate_Wild_Ignore_Table:
168 Last_Errno: 0
169 Last_Error:
170 Skip_Counter: 0
171 Exec_Master_Log_Pos: 885891780
172 Relay_Log_Space: 885892704
173 Until_Condition: None
174 Until_Log_File:
175 Until_Log_Pos: 0
176 Master_SSL_Allowed: Yes
177 Master_SSL_CA_File:
178 Master_SSL_CA_Path:
179 Master_SSL_Cert:
180 Master_SSL_Cipher:
181 Master_SSL_Key:
182 Seconds_Behind_Master: NULL
183 Master_SSL_Verify_Server_Cert: No
184 Last_IO_Errno: 0
185 Last_IO_Error:
186 Last_SQL_Errno: 0
187 Last_SQL_Error:
188 Replicate_Ignore_Server_Ids:
189 Master_Server_Id: 0
190 Master_SSL_Crl:
191 Master_SSL_Crlpath:
192 Using_Gtid: No
193 Gtid_IO_Pos:
194 Replicate_Do_Domain_Ids:
195 Replicate_Ignore_Domain_Ids:
196 Parallel_Mode: conservative
197 SQL_Delay: 0
198 SQL_Remaining_Delay: NULL
199 Slave_SQL_Running_State:
200 Slave_DDL_Groups: 0
201Slave_Non_Transactional_Groups: 0
202 Slave_Transactional_Groups: 0
203 Retried_transactions: 0
204 Max_relay_log_size: 1073741824
205 Executed_log_entries: 0
206 Slave_received_heartbeats: 0
207 Slave_heartbeat_period: 30.000
208 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
209*************************** 4. row ***************************
210 Connection_name: s4
211 Slave_SQL_State:
212 Slave_IO_State:
213 Master_Host: db1125.eqiad.wmnet
214 Master_User: repl
215 Master_Port: 3314
216 Connect_Retry: 60
217 Master_Log_File: db1125-bin.003591
218 Read_Master_Log_Pos: 728051471
219 Relay_Log_File: db1141-relay-bin-s4.000281
220 Relay_Log_Pos: 728051767
221 Relay_Master_Log_File: db1125-bin.003591
222 Slave_IO_Running: No
223 Slave_SQL_Running: No
224 Replicate_Do_DB:
225 Replicate_Ignore_DB:
226 Replicate_Do_Table:
227 Replicate_Ignore_Table:
228 Replicate_Wild_Do_Table:
229 Replicate_Wild_Ignore_Table:
230 Last_Errno: 0
231 Last_Error:
232 Skip_Counter: 0
233 Exec_Master_Log_Pos: 728051471
234 Relay_Log_Space: 728052395
235 Until_Condition: None
236 Until_Log_File:
237 Until_Log_Pos: 0
238 Master_SSL_Allowed: Yes
239 Master_SSL_CA_File:
240 Master_SSL_CA_Path:
241 Master_SSL_Cert:
242 Master_SSL_Cipher:
243 Master_SSL_Key:
244 Seconds_Behind_Master: NULL
245 Master_SSL_Verify_Server_Cert: No
246 Last_IO_Errno: 0
247 Last_IO_Error:
248 Last_SQL_Errno: 0
249 Last_SQL_Error:
250 Replicate_Ignore_Server_Ids:
251 Master_Server_Id: 0
252 Master_SSL_Crl:
253 Master_SSL_Crlpath:
254 Using_Gtid: No
255 Gtid_IO_Pos:
256 Replicate_Do_Domain_Ids:
257 Replicate_Ignore_Domain_Ids:
258 Parallel_Mode: conservative
259 SQL_Delay: 0
260 SQL_Remaining_Delay: NULL
261 Slave_SQL_Running_State:
262 Slave_DDL_Groups: 0
263Slave_Non_Transactional_Groups: 0
264 Slave_Transactional_Groups: 0
265 Retried_transactions: 0
266 Max_relay_log_size: 1073741824
267 Executed_log_entries: 0
268 Slave_received_heartbeats: 0
269 Slave_heartbeat_period: 30.000
270 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
271*************************** 5. row ***************************
272 Connection_name: s5
273 Slave_SQL_State:
274 Slave_IO_State:
275 Master_Host: db1124.eqiad.wmnet
276 Master_User: repl
277 Master_Port: 3315
278 Connect_Retry: 60
279 Master_Log_File: db1124-bin.001231
280 Read_Master_Log_Pos: 63949745
281 Relay_Log_File: db1141-relay-bin-s5.000111
282 Relay_Log_Pos: 63950041
283 Relay_Master_Log_File: db1124-bin.001231
284 Slave_IO_Running: No
285 Slave_SQL_Running: No
286 Replicate_Do_DB:
287 Replicate_Ignore_DB:
288 Replicate_Do_Table:
289 Replicate_Ignore_Table:
290 Replicate_Wild_Do_Table:
291 Replicate_Wild_Ignore_Table:
292 Last_Errno: 0
293 Last_Error:
294 Skip_Counter: 0
295 Exec_Master_Log_Pos: 63949745
296 Relay_Log_Space: 63950669
297 Until_Condition: None
298 Until_Log_File:
299 Until_Log_Pos: 0
300 Master_SSL_Allowed: Yes
301 Master_SSL_CA_File:
302 Master_SSL_CA_Path:
303 Master_SSL_Cert:
304 Master_SSL_Cipher:
305 Master_SSL_Key:
306 Seconds_Behind_Master: NULL
307 Master_SSL_Verify_Server_Cert: No
308 Last_IO_Errno: 0
309 Last_IO_Error:
310 Last_SQL_Errno: 0
311 Last_SQL_Error:
312 Replicate_Ignore_Server_Ids:
313 Master_Server_Id: 0
314 Master_SSL_Crl:
315 Master_SSL_Crlpath:
316 Using_Gtid: No
317 Gtid_IO_Pos:
318 Replicate_Do_Domain_Ids:
319 Replicate_Ignore_Domain_Ids:
320 Parallel_Mode: conservative
321 SQL_Delay: 0
322 SQL_Remaining_Delay: NULL
323 Slave_SQL_Running_State:
324 Slave_DDL_Groups: 0
325Slave_Non_Transactional_Groups: 0
326 Slave_Transactional_Groups: 0
327 Retried_transactions: 0
328 Max_relay_log_size: 1073741824
329 Executed_log_entries: 0
330 Slave_received_heartbeats: 0
331 Slave_heartbeat_period: 30.000
332 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
333*************************** 6. row ***************************
334 Connection_name: s6
335 Slave_SQL_State:
336 Slave_IO_State:
337 Master_Host: db1125.eqiad.wmnet
338 Master_User: repl
339 Master_Port: 3316
340 Connect_Retry: 60
341 Master_Log_File: db1125-bin.001525
342 Read_Master_Log_Pos: 513418261
343 Relay_Log_File: db1141-relay-bin-s6.000209
344 Relay_Log_Pos: 513418557
345 Relay_Master_Log_File: db1125-bin.001525
346 Slave_IO_Running: No
347 Slave_SQL_Running: No
348 Replicate_Do_DB:
349 Replicate_Ignore_DB:
350 Replicate_Do_Table:
351 Replicate_Ignore_Table:
352 Replicate_Wild_Do_Table:
353 Replicate_Wild_Ignore_Table:
354 Last_Errno: 0
355 Last_Error:
356 Skip_Counter: 0
357 Exec_Master_Log_Pos: 513418261
358 Relay_Log_Space: 513419185
359 Until_Condition: None
360 Until_Log_File:
361 Until_Log_Pos: 0
362 Master_SSL_Allowed: Yes
363 Master_SSL_CA_File:
364 Master_SSL_CA_Path:
365 Master_SSL_Cert:
366 Master_SSL_Cipher:
367 Master_SSL_Key:
368 Seconds_Behind_Master: NULL
369 Master_SSL_Verify_Server_Cert: No
370 Last_IO_Errno: 0
371 Last_IO_Error:
372 Last_SQL_Errno: 0
373 Last_SQL_Error:
374 Replicate_Ignore_Server_Ids:
375 Master_Server_Id: 0
376 Master_SSL_Crl:
377 Master_SSL_Crlpath:
378 Using_Gtid: No
379 Gtid_IO_Pos:
380 Replicate_Do_Domain_Ids:
381 Replicate_Ignore_Domain_Ids:
382 Parallel_Mode: conservative
383 SQL_Delay: 0
384 SQL_Remaining_Delay: NULL
385 Slave_SQL_Running_State:
386 Slave_DDL_Groups: 0
387Slave_Non_Transactional_Groups: 0
388 Slave_Transactional_Groups: 0
389 Retried_transactions: 0
390 Max_relay_log_size: 1073741824
391 Executed_log_entries: 0
392 Slave_received_heartbeats: 0
393 Slave_heartbeat_period: 30.000
394 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
395*************************** 7. row ***************************
396 Connection_name: s7
397 Slave_SQL_State:
398 Slave_IO_State:
399 Master_Host: db1125.eqiad.wmnet
400 Master_User: repl
401 Master_Port: 3317
402 Connect_Retry: 60
403 Master_Log_File: db1125-bin.002201
404 Read_Master_Log_Pos: 788792501
405 Relay_Log_File: db1141-relay-bin-s7.000297
406 Relay_Log_Pos: 788792797
407 Relay_Master_Log_File: db1125-bin.002201
408 Slave_IO_Running: No
409 Slave_SQL_Running: No
410 Replicate_Do_DB:
411 Replicate_Ignore_DB:
412 Replicate_Do_Table:
413 Replicate_Ignore_Table:
414 Replicate_Wild_Do_Table:
415 Replicate_Wild_Ignore_Table:
416 Last_Errno: 0
417 Last_Error:
418 Skip_Counter: 0
419 Exec_Master_Log_Pos: 788792501
420 Relay_Log_Space: 788793425
421 Until_Condition: None
422 Until_Log_File:
423 Until_Log_Pos: 0
424 Master_SSL_Allowed: Yes
425 Master_SSL_CA_File:
426 Master_SSL_CA_Path:
427 Master_SSL_Cert:
428 Master_SSL_Cipher:
429 Master_SSL_Key:
430 Seconds_Behind_Master: NULL
431 Master_SSL_Verify_Server_Cert: No
432 Last_IO_Errno: 0
433 Last_IO_Error:
434 Last_SQL_Errno: 0
435 Last_SQL_Error:
436 Replicate_Ignore_Server_Ids:
437 Master_Server_Id: 0
438 Master_SSL_Crl:
439 Master_SSL_Crlpath:
440 Using_Gtid: No
441 Gtid_IO_Pos:
442 Replicate_Do_Domain_Ids:
443 Replicate_Ignore_Domain_Ids:
444 Parallel_Mode: conservative
445 SQL_Delay: 0
446 SQL_Remaining_Delay: NULL
447 Slave_SQL_Running_State:
448 Slave_DDL_Groups: 0
449Slave_Non_Transactional_Groups: 0
450 Slave_Transactional_Groups: 0
451 Retried_transactions: 0
452 Max_relay_log_size: 1073741824
453 Executed_log_entries: 0
454 Slave_received_heartbeats: 0
455 Slave_heartbeat_period: 30.000
456 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
457*************************** 8. row ***************************
458 Connection_name: s8
459 Slave_SQL_State:
460 Slave_IO_State:
461 Master_Host: db1124.eqiad.wmnet
462 Master_User: repl
463 Master_Port: 3318
464 Connect_Retry: 60
465 Master_Log_File: db1124-bin.003713
466 Read_Master_Log_Pos: 937829874
467 Relay_Log_File: db1141-relay-bin-s8.000281
468 Relay_Log_Pos: 937830170
469 Relay_Master_Log_File: db1124-bin.003713
470 Slave_IO_Running: No
471 Slave_SQL_Running: No
472 Replicate_Do_DB:
473 Replicate_Ignore_DB:
474 Replicate_Do_Table:
475 Replicate_Ignore_Table:
476 Replicate_Wild_Do_Table:
477 Replicate_Wild_Ignore_Table:
478 Last_Errno: 0
479 Last_Error:
480 Skip_Counter: 0
481 Exec_Master_Log_Pos: 937829874
482 Relay_Log_Space: 937830798
483 Until_Condition: None
484 Until_Log_File:
485 Until_Log_Pos: 0
486 Master_SSL_Allowed: Yes
487 Master_SSL_CA_File:
488 Master_SSL_CA_Path:
489 Master_SSL_Cert:
490 Master_SSL_Cipher:
491 Master_SSL_Key:
492 Seconds_Behind_Master: NULL
493 Master_SSL_Verify_Server_Cert: No
494 Last_IO_Errno: 0
495 Last_IO_Error:
496 Last_SQL_Errno: 0
497 Last_SQL_Error:
498 Replicate_Ignore_Server_Ids:
499 Master_Server_Id: 0
500 Master_SSL_Crl:
501 Master_SSL_Crlpath:
502 Using_Gtid: No
503 Gtid_IO_Pos:
504 Replicate_Do_Domain_Ids:
505 Replicate_Ignore_Domain_Ids:
506 Parallel_Mode: conservative
507 SQL_Delay: 0
508 SQL_Remaining_Delay: NULL
509 Slave_SQL_Running_State:
510 Slave_DDL_Groups: 0
511Slave_Non_Transactional_Groups: 0
512 Slave_Transactional_Groups: 0
513 Retried_transactions: 0
514 Max_relay_log_size: 1073741824
515 Executed_log_entries: 0
516 Slave_received_heartbeats: 0
517 Slave_heartbeat_period: 30.000
518 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-578966402,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59100998,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-852264897,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1439343414,171970663-171970663-274,171970664-171970664-1115858508,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-699493107,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1176472,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1586617107,171978787-171978787-1411881839,171978876-171978876-1966860935,171978924-171978924-2620293088,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046

Good news, db1141 has been restarted 10 minutes ago, and so far no crashes and no errors on the log. All clean.
I am going to wait a few more minutes, but then I would like to pool it on analytics role, to see if that triggers something.

Change 599153 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Pool db1141 into analytics role

https://gerrit.wikimedia.org/r/599153

Mentioned in SAL (#wikimedia-operations) [2020-05-28T04:44:11Z] <marostegui> Run check_private data on db1141 - T249188

db1141 check data finished up clean, I have checked grants, roles and query killer, and the host is ready to be pooled.

Change 599153 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Pool db1141 into analytics role

https://gerrit.wikimedia.org/r/599153

Mentioned in SAL (#wikimedia-operations) [2020-05-28T08:13:56Z] <marostegui> Pool db1141 into labsdb analytics role - T249188

db1141 is now serving the analytics role. I can query it finely and I can see other connections arriving from the proxy:

marostegui@tools-sgebastion-07:~$ sql --cluster analytics enwiki_p
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 13325
Server version: 10.4.13-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [enwiki_p]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1141     |
+------------+
1 row in set (0.00 sec)

MariaDB [enwiki_p]>

From quarry it also works fine: https://quarry.wmflabs.org/query/45259

And from the proxy:

|  6497 | s52927          | 10.64.37.27:37470 | commonswiki_p      | Sleep     |   110 |                                                                             | NULL                                                                                                 |    0.000 |
|  7662 | s51295          | 10.64.37.27:39798 | commonswiki_p      | Sleep     |    33 |                                                                             | NULL                                                                                                 |    0.000 |
|  9985 | s51295          | 10.64.37.27:44440 | commonswiki_p      | Sleep     |    32 |                                                                             | NULL                                                                                                 |    0.000 |
|  9997 | s52741          | 10.64.37.27:44464 | cswiki_p           | Sleep     |    88 |                                                                             | NULL                                                                                                 |    0.000 |
| 10078 | s52788          | 10.64.37.27:44626 | enwiki_p           | Sleep     |    88 |                                                                             | NULL                                                                                                 |    0.000 |
| 11166 | s52788          | 10.64.37.27:46806 | enwiki_p           | Sleep     |    81 |                                                                             | NULL                                                                                                 |    0.000 |
| 12155 | s54113          | 10.64.37.27:48784 | commonswiki_p      | Sleep     |    75 |                                                                             | NULL                                                                                                 |    0.000 |
| 12728 | s54113          | 10.64.37.27:49932 | commonswiki_p      | Sleep     |    71 |                                                                             | NULL                                                                                                 |    0.000 |
| 13325 | u15343          | 10.64.37.27:51124 | enwiki_p           | Sleep     |    62 |                                                                             | NULL                                                                                                 |    0.000 |
| 14469 | s54209          | 10.64.37.27:53414 | dewiki_p           | Query     |    59 | Creating sort index                                                         | select page_namespace, page_title from page, templatelinks where tl_from = page_id and tl_title = 'K |    0.000 |
| 16902 | s51592          | 10.64.37.27:58288 | svwiki_p           | Sleep     |    43 |                                                                             | NULL                                                                                                 |    0.000 |
| 17202 | s52927          | 10.64.37.27:58890 | commonswiki_p      | Sleep     |    41 |                                                                             | NULL                                                                                                 |    0.000 |
| 20995 | s52927          | 10.64.37.27:38242 | commonswiki_p      | Sleep     |    18 |                                                                             | NULL                                                                                                 |    0.000 |
| 21001 | s52927          | 10.64.37.27:38254 | commonswiki_p      | Sleep     |    18 |                                                                             | NULL                                                                                                 |    0.000 |
| 21009 | s52835          | 10.64.37.27:38270 | enwiki_p           | Query     |    18 | Sending data                                                                | select page_namespace, ll_title, pa_importance, pa_class
    from page, langlinks, page_assessments, |    0.000 |
| 21235 | s51223          | 10.64.37.27:38722 | itwiki_p           | Query     |    17 | Sending data                                                                | SELECT CONCAT('# [[', page_title, ']]')
FROM page
WHERE page_namespace = 0
AND page_is_redirect = 0
 |    0.000 |
| 22399 | s54113          | 10.64.37.27:41054 | commonswiki_p      | Sleep     |    11 |                                                                             | NULL                                                                                                 |    0.000 |
| 23541 | s51469          | 10.64.37.27:43338 | enwiki_p           | Query     |     4 | Sending data                                                                | SELECT rev_id, rev_len, page_title, page_namespace, rev_timestamp, com.comment_text, page_is_redirec |    0.000 |
| 23543 | s51469          | 10.64.37.27:43342 | enwiki_p           | Query     |     4 | Sending data                                                                | SELECT COUNT(*)
FROM revision_userindex JOIN page ON page_id=rev_page
INNER JOIN comment com ON rev_ |    0.000 |
| 23644 | u12903          | 10.64.37.27:43542 | hewiki_p           | Query     |     0 | Sending data                                                                | select /*SLOW_OK updatestats*/ r.*, page_id, page_title, page_namespace, page_is_redirect, rp.rev_le |    0.000 |
| 23781 | u12903          | 10.64.37.27:43818 | lijwiki_p          | Sleep     |     1 |                                                                             | NULL                                                                                                 |    0.000 |
| 23939 | u12903          | 10.64.37.27:44134 | plwikinews_p       | Sleep     |     0 |                                                                             | NULL                                                                                                 |    0.000 |
| 24050 | s52835          | 10.64.37.27:44356 | frwiki_p           | Query     |     1 | Sending data                                                                | select page_namespace, ll_title, pa_importance, pa_class
    from page, langlinks, page_assessments, |    0.000 |
| 24250 | s52835          | 10.64.37.27:44766 | frwiki_p           | Query     |     0 | Sending data                                                                | select page_namespace, ll_title, pa_importance, pa_class
    from page, langlinks, page_assessments, |    0.000 |
| 24268 | root            | localhost         | NULL               | Query     |     0 | Init                                                                        | show processlist                                                                                     |    0.000 |
+-------+-----------------+-------------------+--------------------+-----------+-------+-----------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
root@db1141:~#

Change 599282 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1141: Enable notifications

https://gerrit.wikimedia.org/r/599282

Change 599282 merged by Marostegui:
[operations/puppet@production] db1141: Enable notifications

https://gerrit.wikimedia.org/r/599282

Removed the binary backup from labsdb1011 from /srv/production/labsdb1011 which was around 6.6T
On the other hand, I am copying the logical dump made on db1141 BEFORE it was restarted once it was in sync with the master (859G) to: backup1001:/srv/production/db1141_logical_once_replication_caught_up (19TB free there).

db1141 keeps serving traffic without any errors or issues. Those are good news.

I am going to leave it running like this till Monday.
If things continue to be fine on Monday, I think I will just do a last test with:

  • Stop MySQL on db1141
  • Copy all the data to labsdb1011
  • Start both hosts, so we can check two things
    • if db1141 keeps working fine after a second restart
    • see if labsdb1011 crashes again with a different set of binary data (coming directly from an already migrated 10.4 host)
  • If db1141 keeps working fine for a few more days after that third restart, I will probably:
    • Put it down, and copy a binary backup from it
    • Attempt to do an in-place upgrade on labsdb1010.

Change 599466 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wmcs: Add db1141.eqiad.wmnet to maintain-dbusers

https://gerrit.wikimedia.org/r/599466

Change 599466 merged by Bstorm:
[operations/puppet@production] wmcs: Add db1141.eqiad.wmnet to maintain-dbusers

https://gerrit.wikimedia.org/r/599466

Change 601138 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Depool db1141 to take a binary dump

https://gerrit.wikimedia.org/r/601138

db1141 has been working fine during the weekend - no trace of possible crashes or the error that precedes the crashes on the logs.
Going to depool lit, take a binary dump and then attempt to upgrade labsdb1010.

If all this works fine, we need to pool labsdb1010 on Analytics role with some weight, otherwise it seems that db1141 cannot serve that role on its own and keeps lagging forever (sort of what we already saw with labsdb1011)

Change 601138 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Depool db1141 to take a binary dump

https://gerrit.wikimedia.org/r/601138

Mentioned in SAL (#wikimedia-operations) [2020-06-01T04:44:08Z] <marostegui> Depool db1141 from Analytics role - T249188

The logical backup mentioned at T249188#6171874, which used to be at backup1001:/srv/production/db1141_logical_once_replication_caught_up has been moved to backup1002:/srv/backups/T249188/ongoing

Mentioned in SAL (#wikimedia-operations) [2020-06-02T05:01:02Z] <marostegui> Stop mysql on db1141 to save a binary backup - T249188

I have stopped mysql on db1141 to take a binary backup to it and it is being copied to: backup1002:/srv/backups/T249188/ongoing/db1141_binary_02_06_2020

These are the positions were replication stopped

1root@db1141:~# mysql -e "show all slaves status\G"
2*************************** 1. row ***************************
3 Connection_name: s1
4 Slave_SQL_State:
5 Slave_IO_State:
6 Master_Host: db1124.eqiad.wmnet
7 Master_User: repl
8 Master_Port: 3311
9 Connect_Retry: 60
10 Master_Log_File: db1124-bin.002732
11 Read_Master_Log_Pos: 936776393
12 Relay_Log_File: db1141-relay-bin-s1.000365
13 Relay_Log_Pos: 936776689
14 Relay_Master_Log_File: db1124-bin.002732
15 Slave_IO_Running: No
16 Slave_SQL_Running: No
17 Replicate_Do_DB:
18 Replicate_Ignore_DB:
19 Replicate_Do_Table:
20 Replicate_Ignore_Table:
21 Replicate_Wild_Do_Table:
22 Replicate_Wild_Ignore_Table:
23 Last_Errno: 0
24 Last_Error:
25 Skip_Counter: 0
26 Exec_Master_Log_Pos: 936776393
27 Relay_Log_Space: 936777042
28 Until_Condition: None
29 Until_Log_File:
30 Until_Log_Pos: 0
31 Master_SSL_Allowed: Yes
32 Master_SSL_CA_File:
33 Master_SSL_CA_Path:
34 Master_SSL_Cert:
35 Master_SSL_Cipher:
36 Master_SSL_Key:
37 Seconds_Behind_Master: NULL
38 Master_SSL_Verify_Server_Cert: No
39 Last_IO_Errno: 0
40 Last_IO_Error:
41 Last_SQL_Errno: 0
42 Last_SQL_Error:
43 Replicate_Ignore_Server_Ids:
44 Master_Server_Id: 171970577
45 Master_SSL_Crl:
46 Master_SSL_Crlpath:
47 Using_Gtid: No
48 Gtid_IO_Pos:
49 Replicate_Do_Domain_Ids:
50 Replicate_Ignore_Domain_Ids:
51 Parallel_Mode: conservative
52 SQL_Delay: 0
53 SQL_Remaining_Delay: NULL
54 Slave_SQL_Running_State:
55 Slave_DDL_Groups: 0
56Slave_Non_Transactional_Groups: 0
57 Slave_Transactional_Groups: 31420623
58 Retried_transactions: 0
59 Max_relay_log_size: 1073741824
60 Executed_log_entries: 152873876
61 Slave_received_heartbeats: 0
62 Slave_heartbeat_period: 30.000
63 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
64*************************** 2. row ***************************
65 Connection_name: s2
66 Slave_SQL_State:
67 Slave_IO_State:
68 Master_Host: db1125.eqiad.wmnet
69 Master_User: repl
70 Master_Port: 3312
71 Connect_Retry: 60
72 Master_Log_File: db1125-bin.002435
73 Read_Master_Log_Pos: 361230525
74 Relay_Log_File: db1141-relay-bin-s2.000341
75 Relay_Log_Pos: 361230619
76 Relay_Master_Log_File: db1125-bin.002435
77 Slave_IO_Running: No
78 Slave_SQL_Running: No
79 Replicate_Do_DB:
80 Replicate_Ignore_DB:
81 Replicate_Do_Table:
82 Replicate_Ignore_Table:
83 Replicate_Wild_Do_Table:
84 Replicate_Wild_Ignore_Table:
85 Last_Errno: 0
86 Last_Error:
87 Skip_Counter: 0
88 Exec_Master_Log_Pos: 361230323
89 Relay_Log_Space: 361231174
90 Until_Condition: None
91 Until_Log_File:
92 Until_Log_Pos: 0
93 Master_SSL_Allowed: Yes
94 Master_SSL_CA_File:
95 Master_SSL_CA_Path:
96 Master_SSL_Cert:
97 Master_SSL_Cipher:
98 Master_SSL_Key:
99 Seconds_Behind_Master: NULL
100 Master_SSL_Verify_Server_Cert: No
101 Last_IO_Errno: 0
102 Last_IO_Error:
103 Last_SQL_Errno: 0
104 Last_SQL_Error:
105 Replicate_Ignore_Server_Ids:
106 Master_Server_Id: 171978766
107 Master_SSL_Crl:
108 Master_SSL_Crlpath:
109 Using_Gtid: No
110 Gtid_IO_Pos:
111 Replicate_Do_Domain_Ids:
112 Replicate_Ignore_Domain_Ids:
113 Parallel_Mode: conservative
114 SQL_Delay: 0
115 SQL_Remaining_Delay: NULL
116 Slave_SQL_Running_State:
117 Slave_DDL_Groups: 0
118Slave_Non_Transactional_Groups: 0
119 Slave_Transactional_Groups: 23182267
120 Retried_transactions: 0
121 Max_relay_log_size: 1073741824
122 Executed_log_entries: 108813458
123 Slave_received_heartbeats: 0
124 Slave_heartbeat_period: 30.000
125 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
126*************************** 3. row ***************************
127 Connection_name: s3
128 Slave_SQL_State:
129 Slave_IO_State:
130 Master_Host: db1124.eqiad.wmnet
131 Master_User: repl
132 Master_Port: 3313
133 Connect_Retry: 60
134 Master_Log_File: db1124-bin.002124
135 Read_Master_Log_Pos: 101374978
136 Relay_Log_File: db1141-relay-bin-s3.000381
137 Relay_Log_Pos: 101375274
138 Relay_Master_Log_File: db1124-bin.002124
139 Slave_IO_Running: No
140 Slave_SQL_Running: No
141 Replicate_Do_DB:
142 Replicate_Ignore_DB:
143 Replicate_Do_Table:
144 Replicate_Ignore_Table:
145 Replicate_Wild_Do_Table:
146 Replicate_Wild_Ignore_Table:
147 Last_Errno: 0
148 Last_Error:
149 Skip_Counter: 0
150 Exec_Master_Log_Pos: 101374978
151 Relay_Log_Space: 101375627
152 Until_Condition: None
153 Until_Log_File:
154 Until_Log_Pos: 0
155 Master_SSL_Allowed: Yes
156 Master_SSL_CA_File:
157 Master_SSL_CA_Path:
158 Master_SSL_Cert:
159 Master_SSL_Cipher:
160 Master_SSL_Key:
161 Seconds_Behind_Master: NULL
162 Master_SSL_Verify_Server_Cert: No
163 Last_IO_Errno: 0
164 Last_IO_Error:
165 Last_SQL_Errno: 0
166 Last_SQL_Error:
167 Replicate_Ignore_Server_Ids:
168 Master_Server_Id: 171970577
169 Master_SSL_Crl:
170 Master_SSL_Crlpath:
171 Using_Gtid: No
172 Gtid_IO_Pos:
173 Replicate_Do_Domain_Ids:
174 Replicate_Ignore_Domain_Ids:
175 Parallel_Mode: conservative
176 SQL_Delay: 0
177 SQL_Remaining_Delay: NULL
178 Slave_SQL_Running_State:
179 Slave_DDL_Groups: 0
180Slave_Non_Transactional_Groups: 0
181 Slave_Transactional_Groups: 24092508
182 Retried_transactions: 0
183 Max_relay_log_size: 1073741824
184 Executed_log_entries: 118525994
185 Slave_received_heartbeats: 0
186 Slave_heartbeat_period: 30.000
187 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
188*************************** 4. row ***************************
189 Connection_name: s4
190 Slave_SQL_State:
191 Slave_IO_State:
192 Master_Host: db1125.eqiad.wmnet
193 Master_User: repl
194 Master_Port: 3314
195 Connect_Retry: 60
196 Master_Log_File: db1125-bin.003609
197 Read_Master_Log_Pos: 138813494
198 Relay_Log_File: db1141-relay-bin-s4.000319
199 Relay_Log_Pos: 138813790
200 Relay_Master_Log_File: db1125-bin.003609
201 Slave_IO_Running: No
202 Slave_SQL_Running: No
203 Replicate_Do_DB:
204 Replicate_Ignore_DB:
205 Replicate_Do_Table:
206 Replicate_Ignore_Table:
207 Replicate_Wild_Do_Table:
208 Replicate_Wild_Ignore_Table:
209 Last_Errno: 0
210 Last_Error:
211 Skip_Counter: 0
212 Exec_Master_Log_Pos: 138813494
213 Relay_Log_Space: 138814143
214 Until_Condition: None
215 Until_Log_File:
216 Until_Log_Pos: 0
217 Master_SSL_Allowed: Yes
218 Master_SSL_CA_File:
219 Master_SSL_CA_Path:
220 Master_SSL_Cert:
221 Master_SSL_Cipher:
222 Master_SSL_Key:
223 Seconds_Behind_Master: NULL
224 Master_SSL_Verify_Server_Cert: No
225 Last_IO_Errno: 0
226 Last_IO_Error:
227 Last_SQL_Errno: 0
228 Last_SQL_Error:
229 Replicate_Ignore_Server_Ids:
230 Master_Server_Id: 171978766
231 Master_SSL_Crl:
232 Master_SSL_Crlpath:
233 Using_Gtid: No
234 Gtid_IO_Pos:
235 Replicate_Do_Domain_Ids:
236 Replicate_Ignore_Domain_Ids:
237 Parallel_Mode: conservative
238 SQL_Delay: 0
239 SQL_Remaining_Delay: NULL
240 Slave_SQL_Running_State:
241 Slave_DDL_Groups: 0
242Slave_Non_Transactional_Groups: 0
243 Slave_Transactional_Groups: 29535492
244 Retried_transactions: 0
245 Max_relay_log_size: 1073741824
246 Executed_log_entries: 144726804
247 Slave_received_heartbeats: 0
248 Slave_heartbeat_period: 30.000
249 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
250*************************** 5. row ***************************
251 Connection_name: s5
252 Slave_SQL_State:
253 Slave_IO_State:
254 Master_Host: db1124.eqiad.wmnet
255 Master_User: repl
256 Master_Port: 3315
257 Connect_Retry: 60
258 Master_Log_File: db1124-bin.001240
259 Read_Master_Log_Pos: 334541028
260 Relay_Log_File: db1141-relay-bin-s5.000131
261 Relay_Log_Pos: 334541324
262 Relay_Master_Log_File: db1124-bin.001240
263 Slave_IO_Running: No
264 Slave_SQL_Running: No
265 Replicate_Do_DB:
266 Replicate_Ignore_DB:
267 Replicate_Do_Table:
268 Replicate_Ignore_Table:
269 Replicate_Wild_Do_Table:
270 Replicate_Wild_Ignore_Table:
271 Last_Errno: 0
272 Last_Error:
273 Skip_Counter: 0
274 Exec_Master_Log_Pos: 334541028
275 Relay_Log_Space: 334541677
276 Until_Condition: None
277 Until_Log_File:
278 Until_Log_Pos: 0
279 Master_SSL_Allowed: Yes
280 Master_SSL_CA_File:
281 Master_SSL_CA_Path:
282 Master_SSL_Cert:
283 Master_SSL_Cipher:
284 Master_SSL_Key:
285 Seconds_Behind_Master: NULL
286 Master_SSL_Verify_Server_Cert: No
287 Last_IO_Errno: 0
288 Last_IO_Error:
289 Last_SQL_Errno: 0
290 Last_SQL_Error:
291 Replicate_Ignore_Server_Ids:
292 Master_Server_Id: 171970577
293 Master_SSL_Crl:
294 Master_SSL_Crlpath:
295 Using_Gtid: No
296 Gtid_IO_Pos:
297 Replicate_Do_Domain_Ids:
298 Replicate_Ignore_Domain_Ids:
299 Parallel_Mode: conservative
300 SQL_Delay: 0
301 SQL_Remaining_Delay: NULL
302 Slave_SQL_Running_State:
303 Slave_DDL_Groups: 0
304Slave_Non_Transactional_Groups: 0
305 Slave_Transactional_Groups: 15652291
306 Retried_transactions: 0
307 Max_relay_log_size: 1073741824
308 Executed_log_entries: 70261978
309 Slave_received_heartbeats: 0
310 Slave_heartbeat_period: 30.000
311 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
312*************************** 6. row ***************************
313 Connection_name: s6
314 Slave_SQL_State:
315 Slave_IO_State:
316 Master_Host: db1125.eqiad.wmnet
317 Master_User: repl
318 Master_Port: 3316
319 Connect_Retry: 60
320 Master_Log_File: db1125-bin.001536
321 Read_Master_Log_Pos: 1041130188
322 Relay_Log_File: db1141-relay-bin-s6.000233
323 Relay_Log_Pos: 1041130484
324 Relay_Master_Log_File: db1125-bin.001536
325 Slave_IO_Running: No
326 Slave_SQL_Running: No
327 Replicate_Do_DB:
328 Replicate_Ignore_DB:
329 Replicate_Do_Table:
330 Replicate_Ignore_Table:
331 Replicate_Wild_Do_Table:
332 Replicate_Wild_Ignore_Table:
333 Last_Errno: 0
334 Last_Error:
335 Skip_Counter: 0
336 Exec_Master_Log_Pos: 1041130188
337 Relay_Log_Space: 1041130837
338 Until_Condition: None
339 Until_Log_File:
340 Until_Log_Pos: 0
341 Master_SSL_Allowed: Yes
342 Master_SSL_CA_File:
343 Master_SSL_CA_Path:
344 Master_SSL_Cert:
345 Master_SSL_Cipher:
346 Master_SSL_Key:
347 Seconds_Behind_Master: NULL
348 Master_SSL_Verify_Server_Cert: No
349 Last_IO_Errno: 0
350 Last_IO_Error:
351 Last_SQL_Errno: 0
352 Last_SQL_Error:
353 Replicate_Ignore_Server_Ids:
354 Master_Server_Id: 171978766
355 Master_SSL_Crl:
356 Master_SSL_Crlpath:
357 Using_Gtid: No
358 Gtid_IO_Pos:
359 Replicate_Do_Domain_Ids:
360 Replicate_Ignore_Domain_Ids:
361 Parallel_Mode: conservative
362 SQL_Delay: 0
363 SQL_Remaining_Delay: NULL
364 Slave_SQL_Running_State:
365 Slave_DDL_Groups: 0
366Slave_Non_Transactional_Groups: 0
367 Slave_Transactional_Groups: 17450549
368 Retried_transactions: 0
369 Max_relay_log_size: 1073741824
370 Executed_log_entries: 81513730
371 Slave_received_heartbeats: 0
372 Slave_heartbeat_period: 30.000
373 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
374*************************** 7. row ***************************
375 Connection_name: s7
376 Slave_SQL_State:
377 Slave_IO_State:
378 Master_Host: db1125.eqiad.wmnet
379 Master_User: repl
380 Master_Port: 3317
381 Connect_Retry: 60
382 Master_Log_File: db1125-bin.002216
383 Read_Master_Log_Pos: 28113002
384 Relay_Log_File: db1141-relay-bin-s7.000329
385 Relay_Log_Pos: 28113298
386 Relay_Master_Log_File: db1125-bin.002216
387 Slave_IO_Running: No
388 Slave_SQL_Running: No
389 Replicate_Do_DB:
390 Replicate_Ignore_DB:
391 Replicate_Do_Table:
392 Replicate_Ignore_Table:
393 Replicate_Wild_Do_Table:
394 Replicate_Wild_Ignore_Table:
395 Last_Errno: 0
396 Last_Error:
397 Skip_Counter: 0
398 Exec_Master_Log_Pos: 28113002
399 Relay_Log_Space: 28113651
400 Until_Condition: None
401 Until_Log_File:
402 Until_Log_Pos: 0
403 Master_SSL_Allowed: Yes
404 Master_SSL_CA_File:
405 Master_SSL_CA_Path:
406 Master_SSL_Cert:
407 Master_SSL_Cipher:
408 Master_SSL_Key:
409 Seconds_Behind_Master: NULL
410 Master_SSL_Verify_Server_Cert: No
411 Last_IO_Errno: 0
412 Last_IO_Error:
413 Last_SQL_Errno: 0
414 Last_SQL_Error:
415 Replicate_Ignore_Server_Ids:
416 Master_Server_Id: 171978766
417 Master_SSL_Crl:
418 Master_SSL_Crlpath:
419 Using_Gtid: No
420 Gtid_IO_Pos:
421 Replicate_Do_Domain_Ids:
422 Replicate_Ignore_Domain_Ids:
423 Parallel_Mode: conservative
424 SQL_Delay: 0
425 SQL_Remaining_Delay: NULL
426 Slave_SQL_Running_State:
427 Slave_DDL_Groups: 0
428Slave_Non_Transactional_Groups: 0
429 Slave_Transactional_Groups: 22208805
430 Retried_transactions: 0
431 Max_relay_log_size: 1073741824
432 Executed_log_entries: 107449302
433 Slave_received_heartbeats: 0
434 Slave_heartbeat_period: 30.000
435 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046
436*************************** 8. row ***************************
437 Connection_name: s8
438 Slave_SQL_State:
439 Slave_IO_State:
440 Master_Host: db1124.eqiad.wmnet
441 Master_User: repl
442 Master_Port: 3318
443 Connect_Retry: 60
444 Master_Log_File: db1124-bin.003740
445 Read_Master_Log_Pos: 583896666
446 Relay_Log_File: db1141-relay-bin-s8.000337
447 Relay_Log_Pos: 583896962
448 Relay_Master_Log_File: db1124-bin.003740
449 Slave_IO_Running: No
450 Slave_SQL_Running: No
451 Replicate_Do_DB:
452 Replicate_Ignore_DB:
453 Replicate_Do_Table:
454 Replicate_Ignore_Table:
455 Replicate_Wild_Do_Table:
456 Replicate_Wild_Ignore_Table:
457 Last_Errno: 0
458 Last_Error:
459 Skip_Counter: 0
460 Exec_Master_Log_Pos: 583896666
461 Relay_Log_Space: 583897315
462 Until_Condition: None
463 Until_Log_File:
464 Until_Log_Pos: 0
465 Master_SSL_Allowed: Yes
466 Master_SSL_CA_File:
467 Master_SSL_CA_Path:
468 Master_SSL_Cert:
469 Master_SSL_Cipher:
470 Master_SSL_Key:
471 Seconds_Behind_Master: NULL
472 Master_SSL_Verify_Server_Cert: No
473 Last_IO_Errno: 0
474 Last_IO_Error:
475 Last_SQL_Errno: 0
476 Last_SQL_Error:
477 Replicate_Ignore_Server_Ids:
478 Master_Server_Id: 171970577
479 Master_SSL_Crl:
480 Master_SSL_Crlpath:
481 Using_Gtid: No
482 Gtid_IO_Pos:
483 Replicate_Do_Domain_Ids:
484 Replicate_Ignore_Domain_Ids:
485 Parallel_Mode: conservative
486 SQL_Delay: 0
487 SQL_Remaining_Delay: NULL
488 Slave_SQL_Running_State:
489 Slave_DDL_Groups: 0
490Slave_Non_Transactional_Groups: 0
491 Slave_Transactional_Groups: 45567546
492 Retried_transactions: 0
493 Max_relay_log_size: 1073741824
494 Executed_log_entries: 268016695
495 Slave_received_heartbeats: 0
496 Slave_heartbeat_period: 30.000
497 Gtid_Slave_Pos: 0-171966669-4075108480,171966555-171966555-1275,171966557-171966557-605630174,171966558-171966558-189,171966574-171966574-2221092918,171966668-171966668-2920,171966669-171966669-4196523483,171966670-171966670-2410812544,171970567-171970567-19776,171970577-171970577-59516678,171970589-171970589-201132050,171970593-171970593-3479,171970594-171970594-872187543,171970599-171970599-24483,171970637-171970637-2116621969,171970645-171970645-288070551,171970661-171970661-1482893009,171970663-171970663-274,171970664-171970664-1143253877,171970751-171970751-58940,171974668-171974668-1523,171974686-171974686-867,171974720-171974720-2572451842,171974769-171974769-2185928,171974784-171974784-43861836,171974792-171974792-378345284,171974853-171974853-722444074,171974883-171974883-1921892293,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-106,171978766-171978766-1184632,171978767-171978767-4484858466,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-4822899280,171978777-171978777-2596176831,171978778-171978778-3298185533,171978786-171978786-1614059290,171978787-171978787-1440375763,171978876-171978876-1972981824,171978924-171978924-2669612155,180355078-180355078-26706647,180355111-180355111-131673159,180355171-180355171-148310907,180355173-180355173-62400,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522,180359190-180359190-192195477,180359207-180359207-2,180359241-180359241-121693516,180359242-180359242-170963125,180359271-180359271-6045596,180363367-180363367-134174373,180367364-180367364-74755871,180367474-180367474-91976046

Change 601715 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Reimage labsdb1011 as stretch

https://gerrit.wikimedia.org/r/601715

Change 601715 merged by Marostegui:
[operations/puppet@production] install_server: Reimage labsdb1011 as stretch

https://gerrit.wikimedia.org/r/601715

Bad news, db1141 is showing InnoDB errors after the restart to copy the data to backup1002 before proceeding with labsdb1010. It hasn't crashed yet, but it is showing all the errors labsdb1011 showed before crashing a few hours later.
So this confirms it is not host specific and not data specific (db1141's was built using labsdb1009 logical dump).

This means we cannot migrate labsdb hosts to Buster and Mariadb 10.4 which essentially blocks all the migrations in core, as having 10.4 master replicating to a 10.1 slave isn't recommended really.
We haven't seen anything like this in any other role we have, only on those multisource ones (which are heavily used, have lots of lag, purge lag and huge queries and lots of killed queries per day, and all the instability that multi-source has shown over the years).

I am going to try to reimage labsdb1011 back to stretch and mariadb 10.1 using the logical dump we still have, it means that it will take around 10 days to be able to get it back online.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['labsdb1011.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006021246_marostegui_6380.log.

Completed auto-reimage of hosts:

['labsdb1011.eqiad.wmnet']

and were ALL successful.

myloader started on labsdb1011 with 18 threads.

Sorry for little impatience: Can you say when you are running with full capacity again? (T252209)

Sorry for little impatience: Can you say when you are running with full capacity again? (T252209)

I am going to try to reimage labsdb1011 back to stretch and mariadb 10.1 using the logical dump we still have, it means that it will take around 10 days to be able to get it back online.

At least another week, possibly 2 more weeks.

At least another week, possibly 2 more weeks.

Okay super! If so it will be very helpful to make a statement about it here.

db1141 finally crashed (it is not pooled)
In order to have that host up-to-date in case we need to temporarily place it for a few minutes to do maintenance on any other host I have done:

set global slave_exec_mode = 'idempotent';

To bypass a few errors on s8 replication, this host won't be pooled unless we need to temporary do maintenance on either labsdb1009 or labsdb1010 until labsdb1011 has finished reimporting all the data.

After 5 days importing data, labsdb1011 has imported 2.8TB out of around 6.5T, so it is going to take ages to import + replicate.
I think I am going to ask @elukey to see if I can stop labsdb1012 tomorrow and do a binary copy of it (which takes around 8h)

Mentioned in SAL (#wikimedia-operations) [2020-06-08T07:05:35Z] <marostegui> Stop MySQL on labsdb1012 to clone labsdb1011 T249188

Replication started on labsdb1011 (Stretch+ Mariadb 10.1 again) after copying it from labsdb1012.

Change 603770 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] labsdb1011: Enable notifications

https://gerrit.wikimedia.org/r/603770

Change 603770 merged by Marostegui:
[operations/puppet@production] labsdb1011: Enable notifications

https://gerrit.wikimedia.org/r/603770

Change 603774 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1018: Pool labsdb1011 and add labsdb1010 with reduced weight

https://gerrit.wikimedia.org/r/603774

Change 598691 abandoned by Marostegui:
dbproxy1018: Add labsdb1010 with reduced weight

Reason:
Will be done via 603774

https://gerrit.wikimedia.org/r/598691

Change 603774 merged by Marostegui:
[operations/puppet@production] dbproxy1018: Pool labsdb1011 and add labsdb1010 with reduced weight

https://gerrit.wikimedia.org/r/603774

Mentioned in SAL (#wikimedia-operations) [2020-06-09T05:32:54Z] <marostegui> Switch dbproxy1018 from "master" service to "replicas" - T249188

After a couple of months, I am declining this task because it has been impossible to upgrade labsdb to Buster and Mariadb 10.4

To sum up:

  • Tried 2 upgrading two different hosts (labsdb1011, db1141): they both crashed after a few hours
  • Tried different ways of importing data and upgrading: upgrading on the fly, upgrading binary coming from labsdb1012, upgrading from a logical dump from labsdb1009.
  • This might be related to multi-source and/or the fact that these hosts have sooo much load, concurrency and lag. We've had hosts upgraded in production (not multi-instance) which had no problems at all.
  • Filed https://jira.mariadb.org/browse/MDEV-22373 (there are several bugs somewhat similar to what we've seen, so it might "known").
  • Labsdb1011 has been reimaged back to stretch and mariadb 10.1 in order to get it back into the mix.
  • I have changed the way we serve the analytics role, right now we have two hosts instead of 1 (labsdb1010 with a bit less weight and labsdb1011).
  • The future of this infrastructure is being discussed now as multi-source no longer scales (having just a single host replicating all production traffic + heavy queries, unfortunately, won't last long as the only possibility is to keep scaling vertically).
    • Multi-source hosts are hard to maintain, as a single corruption means it corrupts all the data, which in the case of these hosts is more than 6TB, and takes days and weeks to be able to put them back in service.

@Marostegui: After the declining of ths task - have you guessed to take something better than MariaDB, if possible? Is our DB running with full capacity again?

@doctaxon all the thoughts and future next steps are at T249188#6204681
Our labsdb hosts are running at full capacity now, and with two hosts on the Analytics role (we used to have 1), which looks like has improved the performance, as we are balancing the load between two hosts instead of having just one handling everything.

As mentioned at T249188#6204681, the future architecture and topology of this service is being discussed at the moment: we are trying to find a way to scale this up in an easier and more performant way.

I wouldn't be surprised if labsdb1009 crashes sometime "soon". I was checking it for some bad performance lately and I have seen this on the logs:

Jul  9 00:18:00 labsdb1009 mysqld[15890]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 32
Jul 21 10:51:52 labsdb1009 mysqld[15890]: InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 0
Jul 21 10:51:52 labsdb1009 mysqld[15890]: InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 0
Jul 21 10:51:52 labsdb1009 mysqld[15890]: InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 0

Which expanded are:

Jul  9 00:18:00 labsdb1009 mysqld[15890]: InnoDB: error in sec index entry update in
Jul  9 00:18:00 labsdb1009 mysqld[15890]: InnoDB: index `el_from_index_60` of table `ruwikinews`.`externallinks`
Jul  9 00:18:00 labsdb1009 mysqld[15890]: InnoDB: tuple DATA TUPLE: 3 fields;
Jul  9 00:18:00 labsdb1009 mysqld[15890]:  0: len 4; hex 00262b42; asc  &+B;;
Jul  9 00:18:00 labsdb1009 mysqld[15890]:  1: len 60; hex 68747470733a2f2f6f72672e776d666c6162732e746f6f6c732e2f6d61737376696577732f3f706c6174666f726d3d616c6c2d616363657373266167; asc https://org.wmflabs.tools./massviews/?platform=all-access&ag;;
Jul  9 00:18:00 labsdb1009 mysqld[15890]:  2: len 4; hex 017283f6; asc  r  ;;
Jul  9 00:18:00 labsdb1009 mysqld[15890]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 32
Jul  9 00:18:00 labsdb1009 mysqld[15890]:  0: len 4; hex 00262b42; asc  &+B;;
Jul  9 00:18:00 labsdb1009 mysqld[15890]:  1: len 30; hex 68747470733a2f2f6f72672e776d666c6162732e746f6f6c732e2f6d6173; asc https://org.wmflabs.tools./mas; (total 60 bytes);
Jul  9 00:18:00 labsdb1009 mysqld[15890]:  2: len 4; hex 017283f5; asc  r  ;;
Jul  9 00:18:00 labsdb1009 mysqld[15890]: TRANSACTION 159440450450, ACTIVE 0 sec updating or deleting
Jul  9 00:18:00 labsdb1009 mysqld[15890]: mysql tables in use 1, locked 1
Jul  9 00:18:00 labsdb1009 mysqld[15890]: 2 lock struct(s), heap size 360, 3 row lock(s), undo log entries 3
Jul  9 00:18:00 labsdb1009 mysqld[15890]: MySQL thread id 19, OS thread handle 0x7f375292c700, query id 26067865130 Delete_rows_log_event::ha_delete_row(-1)
Jul  9 00:18:00 labsdb1009 mysqld[15890]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

Which are similar to the errors seeing on the other crashes (with 10.4), however, I am not seeing any ERROR entry, which is what we saw before, so that's good so far.
Just leaving this here for the record.

Update on this:
So MariaDB has shipped some fixes that could be somewhat related to this on the last 10.4.14.
They've also been able to reproduce the error(s) or similar ones, so they are investigating. Looks related to the change buffer.

Still, no point on trying to get labsdb1011 to 10.4, as if it fails again, it is extremely time consuming.