Page MenuHomePhabricator

db2120 & db2121 crashed
Closed, ResolvedPublic

Description

MySQL crashed,

Sep  2 13:58:20 db2120 mysqld[1095]: 2020-09-02 13:58:20 107197705 [ERROR] InnoDB: Database page corruption on disk or a failed file read of tablespace metawiki/content page [page id: space=4936, page number=49350]. You may have to recover from a backup.
Sep  2 13:58:20 db2120 mysqld[1095]: 2020-09-02 13:58:20 107197705 [Note] InnoDB: Page dump in ascii and hex (16384 bytes):
Sep  2 13:58:20 db2120 mysqld[1095]:  len 16384; hex 58252d810000c0c60000c0c50000c0c7000014f062d2a6a045bf00000000000000000000134800323b8c80c9000000003b47000200c600c7000000000000000000000000000000004d7e0000000000000000000000000000000000000000010002001c696e66696d756d0008000
b000073757072656d756d0b1f000010004c0000000000967dde0004634a1a2dff000002a63b8e000003cd646368746a30797a616c74396e3834386d70396b73637130326b6c7074336c000174743a31303136363635310b1f000018004c0000000000967ddf0004634a1a2dff000002a63ba0000003cd646368746a30797a616c74396e3834386d7
0396b73637130326b6c7074336c000174743a31303136363635320b1f000020004c0000000000967de00004634a1a2dff000002a63bb2000003cd646368746a30797a616c74396e3834386d70396b73637130326b6c7074336c000174743a31303136363635330b1f040028004c0000000000967de10004634a1a2dff000002a63bc4000003cd646
368746a30797a616c74396e3834386d70396b73637130326b6c7074336c000174743a31303136363635340b1f000030004c0000000000967de20004634a1a2dff000002a63bd6000003cd646368746a30797a616c74396e3834386d70396b73637130326b6c7074336c000174743a31303136363635350b1f000038004c0000000000967de300046
34a1a2dff000002a63be8000003cd39733962726e7a763063683170306f7a7037656d6b6b706b376a3661356561000174743a31303136363635360b1f000040004c0000000000967de40004634a1a2dff000002a63bfa000003cd39733962726e7a763063683170306f7a7037656d6b6b706b376a3661356561000174743a31303136363635370b1
f040048004c0000000000967de50004634a1a2dff000002a63c0c000003cd39733962726e7a763063683170306f7a7037656d6b6b706b376a3661356561000174743a31303136363635380b1f000050004c0000000000967de60004634a1a2dff000002a63c1e000003cd39733962726e7a763063683170306f7a7037656d6b6b706b376a3661356
561000174743a31303136363635390b1f000058004c0000000000967de70004634a1a2dff000002a63c30000003cd39733962726e7a763063683170306f7a7037656d6b6b706b376a3661356561000174743a31303136363636300b1f000060004c0000000000967de80004634a1a2dff000002a63c42000003cd39733962726e7a7630636831703
06f7a7037656d6b6b706b376a3661356561000174743a31303136363636310b1f040068004c0000000000967de90004634a1a2dff000002a63c54000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363636320b1f000070004c0000000000967dea0004634a1a2dff000002a63c66000
003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363636330b1f000078004c0000000000967deb0004634a1a2dff000002a63c78000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363636340b1f000080004c0000000000967
dec0004634a1a2dff000002a63c8a000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363636350b1f040088004c0000000000967ded0004634a1a2dff000002a63c9c000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363
636360b1f000090004c0000000000967dee0004634a1a2dff000002a63cae000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363636370b1f000098004c0000000000967def0004634a1a2dff000002a63cc0000003cd6a736877786265356c3173723965696c6f33356274623563717
06f6d75746e000174743a31303136363636380b1f0000a0004c0000000000967df00004634a1a2dff000002a63cd2000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363636390b1f0400a8004c0000000000967df10004634a1a2dff000002a63ce4000003cd6a736877786265356c3
173723965696c6f3335627462356371706f6d75746e000174743a31303136363637300b1f0000b0004c0000000000967df20004634a1a2dff000002a63cf6000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363637310b1f0000b8004c0000000000967df30004634a1a2dff000002a
63d08000003cd6a736877786265356c3173723965696c6f3335627462356371706f6d75746e000174743a31303136363637320b1f0000c0004c0000000000967df40004634a1a2dff000002a63d1a000003cd616b3837773335376d6e65623867316174716e32756d736d62646c6e756865000174743a31303136363637330b1f0400c8004c00000
00000967df50004634a1a2dff000002a63d2c000003cd616b3837773335376d6e65623867316174716e32756d736d62646c6e756865000174743a31303136363637340b1f0000d0004c0000000000967df60004634a1a2dff000002a63d3e000003cd616b3837773335376d6e65623867316174716e3275

Let's recover rebuilt it

Both crashes are on metawiki.content table

Event Timeline

Kormat moved this task from Triage to In progress on the DBA board.
Kormat added a project: User-Kormat.
Kormat moved this task from Unsorted 💣 to Active 🚁 on the User-Kormat board.

HW logs look ok, controller logs and ilo ones. This host history is T236453: Degraded RAID on db2120

Mentioned in SAL (#wikimedia-operations) [2020-09-02T14:18:54Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12434 and previous config saved to /var/cache/conftool/dbconfig/20200902-141854-marostegui.json

Host is downtimed for 24h, and recovery from backups has started.

Marostegui triaged this task as Medium priority.Sep 2 2020, 2:33 PM

Restore completed, packages upgraded, and machine currently rebooting. Will then do mysql_upgrade, and start replication.

jcrespo renamed this task from db2120 crashed to db2120 & db2121 crashed.Sep 3 2020, 6:42 AM

Mentioned in SAL (#wikimedia-operations) [2020-09-03T06:43:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12441 and previous config saved to /var/cache/conftool/dbconfig/20200903-064334-marostegui.json

db2121 crashed with the same database and table:

Sep 03 06:35:19 db2121 mysqld[3365]: 2020-09-03  6:35:19 349115836 [Note] InnoDB: Index 19838 is `PRIMARY` in table `metawiki`.`content`
Sep 03 06:35:19 db2121 mysqld[3365]: 2020-09-03  6:35:19 349115836 [Note] InnoDB: It is also possible that your operating system has corrupted its own file cache and rebooting your computer removes the error. If the corrupt page is an index page. You can also try to fix t
Sep 03 06:35:19 db2121 mysqld[3365]: 2020-09-03  6:35:19 349115836 [ERROR] InnoDB: Database page corruption on disk or a failed file read of tablespace metawiki/content page [page id: space=4936, page number=49350]. You may have to recover from a backup.
Sep 03 06:35:19 db2121 mysqld[3365]: 2020-09-03  6:35:19 349115836 [Note] InnoDB: Page dump in ascii and hex (16384 bytes):
Sep 03 06:35:19 db2121 mysqld[3365]:  len 16384; hex 58252d810000c0c60000c0c50000c0c7000014f062d2a6a045bf00000000000000000000134800323b8c80c9000000003b47000200c600c7000000000000000000000000000000004d7e0000000000000000000000000000000000000000010002001c696e66696d756d0008000
Sep 03 06:35:19 db2121 mysqld[3365]:  nc78cd98pm89zjisagy77u5pu86xu1r  tt:10166777      L      ~]  cJ -           hm24skj7nhhsexi2qty45stkud98vwf  tt:10166778      L      ~^  cJ -           hm24skj7nhhsexi2qty45stkud98vwf  tt:10166779      L      ~_  cJ -           hm24sk
Sep 03 06:35:19 db2121 mysqld[3365]: InnoDB: End of page dump

Mentioned in SAL (#wikimedia-operations) [2020-09-03T06:46:24Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12442 and previous config saved to /var/cache/conftool/dbconfig/20200903-064623-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T06:48:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12443 and previous config saved to /var/cache/conftool/dbconfig/20200903-064804-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T06:51:06Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12444 and previous config saved to /var/cache/conftool/dbconfig/20200903-065105-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T07:01:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12445 and previous config saved to /var/cache/conftool/dbconfig/20200903-070104-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T07:02:26Z] <marostegui> Stop db2100:3317 and db2121 in sync to reload metawiki.content T261869

Mentioned in SAL (#wikimedia-operations) [2020-09-03T07:27:17Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12446 and previous config saved to /var/cache/conftool/dbconfig/20200903-072716-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T07:31:17Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12447 and previous config saved to /var/cache/conftool/dbconfig/20200903-073116-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T07:37:19Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12448 and previous config saved to /var/cache/conftool/dbconfig/20200903-073718-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T07:44:27Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool db2121 T261869', diff saved to https://phabricator.wikimedia.org/P12449 and previous config saved to /var/cache/conftool/dbconfig/20200903-074426-marostegui.json

Resolving this as both hosts are back in production.
db2120 was fully rebuilt
db2121 got its table reloaded from the backup source in codfw.
The follow up is to reload metawiki.content table on s7 10.4 hosts: T261917

If there are updates on the mariadb bugs I will post them here.