Page MenuHomePhabricator

Reimage and reclone db2125
Closed, ResolvedPublic

Description

After the last HW crash, this host wasn't recloned and MySQL just crashed after less than 24h after being repooled.
However I noticed it did crash a bit after it was restarted after the onsite maintenance

Sep 17 13:53:07 db2125 mysqld[3189]: 2020-09-17 13:53:07 8 [Note] Slave I/O thread: connected to master 'repl@db2107.codfw.wmnet:3306',replication starts at GTID position '171970567-171970567-390719905,0-180359173-4880477695,180359173-180359173-70825087,171978786-17197878
Sep 17 13:54:16 db2125 mysqld[3189]: 2020-09-17 13:54:16 0 [ERROR] InnoDB: Unable to purge a record
Sep 17 13:54:16 db2125 mysqld[3189]: InnoDB: tuple DATA TUPLE: 2 fields;
Sep 17 13:54:16 db2125 mysqld[3189]:  0: len 4; hex 005c20d2; asc  \  ;;
Sep 17 13:54:16 db2125 mysqld[3189]:  1: len 4; hex 0341d33a; asc  A :;;
Sep 17 13:54:16 db2125 mysqld[3189]: InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 0
Sep 17 13:54:16 db2125 mysqld[3189]:  0: len 4; hex 005c20d2; asc  \  ;;
Sep 17 13:54:16 db2125 mysqld[3189]:  1: len 4; hex 0341d33a; asc  A :;;
Sep 17 13:54:16 db2125 mysqld[3189]: space 7137 offset 4792 (1103 records, index id 25623)
Sep 17 13:54:16 db2125 mysqld[3189]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
Sep 17 13:56:27 db2125 mysqld[3189]: 2020-09-17 13:56:27 0 [ERROR] InnoDB: Unable to purge a record
Sep 17 13:56:27 db2125 mysqld[3189]: InnoDB: tuple DATA TUPLE: 2 fields;
Sep 17 13:56:27 db2125 mysqld[3189]:  0: len 4; hex 002b95d4; asc  +  ;;
Sep 17 13:56:27 db2125 mysqld[3189]:  1: len 4; hex 0061c475; asc  a u;;
Sep 17 13:56:27 db2125 mysqld[3189]: InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 0
Sep 17 13:56:27 db2125 mysqld[3189]:  0: len 4; hex 002b95d4; asc  +  ;;
Sep 17 13:56:27 db2125 mysqld[3189]:  1: len 4; hex 0061c475; asc  a u;;

This error does look similar to the ones experienced on labsdb1011. So I am going to update mariadb's bug.

Let's not only reclone, but also reimage, just to be fully sure.

Event Timeline

There was a buffer overflow detected....

Sep 18 11:10:32 db2125 mysqld[3189]: It is possible that mysqld could use up to
Sep 18 11:10:32 db2125 mysqld[3189]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4752266 K  bytes of memory
Sep 18 11:10:32 db2125 mysqld[3189]: Hope that's ok; if not, decrease some variables in the equation.
Sep 18 11:10:32 db2125 mysqld[3189]: Thread pointer: 0x7f7b840014f8
Sep 18 11:10:32 db2125 mysqld[3189]: Attempting backtrace. You can use the following information to find out
Sep 18 11:10:32 db2125 mysqld[3189]: where mysqld died. If you see no messages after this, something went
Sep 18 11:10:32 db2125 mysqld[3189]: terribly wrong...
Sep 18 11:10:32 db2125 mysqld[3189]: stack_bottom = 0x7fdbf8186698 thread_stack 0x30000
Sep 18 11:10:35 db2125 mysqld[3189]: *** buffer overflow detected ***: /opt/wmf-mariadb104/bin/mysqld terminated

Change 628312 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/628312

Change 628312 merged by Marostegui:
[operations/puppet@production] db2125: Disable notifications

https://gerrit.wikimedia.org/r/628312

Marostegui moved this task from Triage to In progress on the DBA board.

I want to try setting this for this host once it is recloned, to see how it goes:

innodb_change_buffering=inserts

MariaDB recommended trying that there's a suspect that this can be playing a role here

innodb_change_buffering=all

For the record this is Mariadb's comment :

This is a known problem also in Oracle MySQL, since at least MySQL 5.5.
It happened very infrequently, and we did not find out a way to trigger it.
Like I wrote in MDEV-9663 some time ago, I suspect that there could be a bug in the InnoDB change buffering that could cause this. I would suggest one of:

SET GLOBAL innodb_change_buffering=inserts;
SET GLOBAL innodb_change_buffering=none;

A more recent comment from them:

We now have a reasonably small test case for reproducing MDEV-22924, and I am working on analyzing it. Currently, it looks like that corruption does would involve the change buffer at all, because there are relatively few pages.

So before starting mysql for the first time I want to set the following on my.cnf for this host and let it run with it for a few days:

innodb_change_buffering=inserts;

However, it is hard to know whether this was triggered by the two crashes this host had because of HW (T260670) or what. But let's try to change that and see how it goes (and how much it can affect performance)

Script wmf-auto-reimage was launched by kormat on cumin2001.codfw.wmnet for hosts:

['db2125.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009181242_kormat_3990.log.

Completed auto-reimage of hosts:

['db2125.codfw.wmnet']

and were ALL successful.

Machine has been reimaged, and the db restored from backup. Puppet is disabled, innodb_change_buffering=inserts has been added to /etc/my.cnf, and the machine is currently catching up on replication.

Thank you so much Stevie for taking care of this!.

And for posterity:

root@db2125.codfw.wmnet[(none)]> show global variables like '%change_buffering';
+-------------------------+---------+
| Variable_name           | Value   |
+-------------------------+---------+
| innodb_change_buffering | inserts |
+-------------------------+---------+
1 row in set (0.034 sec)

I will monitor closely how this affects performance once we repool this host on Monday.

Mentioned in SAL (#wikimedia-operations) [2020-09-21T08:18:34Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12678 and previous config saved to /var/cache/conftool/dbconfig/20200921-081833-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-09-21T08:33:37Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12679 and previous config saved to /var/cache/conftool/dbconfig/20200921-083337-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-09-21T08:48:40Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12681 and previous config saved to /var/cache/conftool/dbconfig/20200921-084840-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-09-21T09:03:44Z] <kormat@cumin1001> dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: reimage+reclone done T263244', diff saved to https://phabricator.wikimedia.org/P12682 and previous config saved to /var/cache/conftool/dbconfig/20200921-090343-kormat.json

Resolving this. Host was reimaged+recloned, but died again due to h/w issues (tracked by T260670: db2125 crashed - mgmt iface also not available)

Tracking for the innodb_change_buffering flag is happening at: T263443