dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Feb 6 2018, 12:32 PM

Description

This is what caused the crash:

Feb  6 12:06:54 dbstore1001 kernel: [10982464.366365] megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0.
Feb  6 12:06:54 dbstore1001 kernel: [10982464.366371] megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0.
Feb  6 12:08:45 dbstore1001 kernel: [10982575.581419] megaraid_sas 0000:03:00.0: Diag reset adapter never cleared megasas_adp_reset_fusion 2713
Feb  6 12:10:37 dbstore1001 kernel: [10982686.920489] megaraid_sas 0000:03:00.0: Diag reset adapter never cleared megasas_adp_reset_fusion 2713

Upon START

Memory/battery problems were detected.
The adapter has recovered, but cached data was lost.
Press any key to continue, or 'C' to load the configuration utility.

Multibit ECC errors were detected on the RAID controller.
If you continue, data corruption can occur
Please contact technical support to resolve this issue.  Press 'X' to
continue or else power off the system, replace the controller and reboot.

Details

Subject	Repo	Branch	Lines +/-
dbstore: Reenable alerts for dbstore1001 after reset	operations/puppet	production	+0 -1
dbstore1001: Set puppet role as mariadb-backups	operations/puppet	production	+8 -1
mariadb: Prepare dbstore1001 for stretch reimage	operations/puppet	production	+1 -3
dbstore1001: Disable notifications	operations/puppet	production	+1 -0
dblists: Remove dbstore1001 for the list of hosts	operations/software	master	+0 -10
prometheus-mysql-exporter: Reflect latest m2 changes, remove dbstore1001	operations/puppet	production	+3 -3
mariadb: Remove s3 from dbstore2001 to save space	operations/puppet	production	+3 -4
mariadb: Remove dbstore1001 role	operations/puppet	production	+3 -10

Customize query in gerrit

Related Objects

Mentioned In: T159430: convert dbstore1001 to multi-instance + InnoDB compressed by importing db shards to it
Mentioned Here: T184697: Failover existing eqiad database backup system to the new codfw database logical backup system

Event Timeline

• Marostegui created this task.Feb 6 2018, 12:32 PM

Restricted Application added a project: SRE. · View Herald TranscriptFeb 6 2018, 12:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2018-02-06T12:32:27Z] <marostegui> Power cycled dbstore1001 after it crashed - T186596

• Marostegui updated the task description. (Show Details)Feb 6 2018, 12:32 PM

• Marostegui renamed this task from dbstore1001 crashed to dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller..Feb 6 2018, 12:35 PM

Begin: Mounting root file system ... Begin: Running /scripts/loc[    9.644741] device-mapper: uevent: version 1.0.3
al-top ... [    9.650495] device-mapper: ioctl: 4.35.0-ioctl (2016-06-23) initialised: dm-devel@redhat.com
done.
Begin: Running /[    9.668657] PM: Starting manual resume from disk
scripts/local-premount ... done.
Begin: Will now check root file system ... fsck from util-linux 2.25.2
[/sbin/fsck.ext3 (1) -- /dev/sda1] fsck.ext3 -a -C0 /dev/sda1
/dev/sda1: recovering journal
/dev/sda1: clean, 70921/2445984 [    9.944113] EXT4-fs (sda1): mounting ext3 file system using the ext4 subsystem
files, 1936248/9764864 blocks
d[    9.954951] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
one.
done.
Begin: Running /scripts/local-bottom ... done.
Begin: Running /scripts/init-bottom ... done.
[   10.440130] ERST: NVRAM ERST Log Address Range not implemented yet.
[   10.460730] systemd[1]: systemd 215 running in system mode. (+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ -SE)
[   10.476063] systemd[1]: Detected architecture 'x86-64'.

Welcome to Debian GNU/Linux 8 (jessie)!

[   10.658020] systemd[1]: Inserted module 'autofs4'
[   10.670529] systemd[1]: Set hostname to <dbstore1001>.
[   10.850008] systemd-sysv-generator[803]: Overwriting existing symlink /run/systemd/generator.late/mysql.service with real service
[   11.014935] systemd[1]: Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed to.
[   11.031925] systemd[1]: Starting Forward Password Requests to Wall Directory Watch.
[   11.040592] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[   11.049066] systemd[1]: Expecting device dev-ttyS1.device...
         Expecting device dev-ttyS1.device...
[   11.068541] systemd[1]: Starting Remote File Systems (Pre).
[  OK  ] Reached target Remote File Systems (Pre).
[   11.088489] systemd[1]: Reached target Remote File Systems (Pre).
[  OK  ] Set up automount Arbitrary Executable File Formats F...utomount Point.
         Expecting device dev-sda2.device...
         Expecting device dev-disk-by\x2duuid-a87e908b\x2deaf...da506.device...
         Expecting device dev-mapper-tank\x2ddata.device...
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on Journal Socket (/dev/log).
[  OK  ] Listening on Device-mapper event daemon FIFOs.
[  OK  ] Listening on LVM2 metadata daemon socket.
[  OK  ] Listening on udev Control Socket.
[  OK  ] Listening on udev Kernel Socket.
[  OK  ] Listening on Journal Socket.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-systemd\x2dfsck.slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Created slice system-serial\x2dgetty.slice.
         Starting Increase datagram queue length...
         Starting Load Kernel Modules...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
         Starting Create list of required static device nodes...rrent kernel...
[   11.467840] nf_conntrack version 0.5.0 (32768 buckets, 262144 max)
         Mounting Debug File System...
         Starting udev Coldplug all Devices...
[  OK  ] Reached target Slices.
[   11.509007] ipmi message handler version 39.2
[  OK  [   11.514566] ipmi device interface
] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Started Increase datagram queue length.
[  OK  ] Started Load Kernel Modules.
[  OK  ] Started Create list of required static device nodes ...current kernel.
         Starting Create Static Device Nodes in /dev...
         Starting Apply Kernel Variables...
[  OK  ] Listening on Syslog Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Started udev Coldplug all Devices.
         Starting udev Wait for Complete Device Initialization...
[  OK  ] Started Apply Kernel Variables.
[  OK  ] Started Create Static Device Nodes in /dev.
         Starting udev Kernel Device Manager...
[  OK  ] Started udev Kernel Device Manager.
         Starting LSB: Set preliminary keymap...
[  OK  ] Found device /dev/ttyS1.
[  OK  ] Found device PERC_H710P 2.
[  OK  ] Found device PERC_H710P 2.
         Activating swap Swap Partition...
         Activating swap /dev/disk/by-uuid/a87e908b-eaf4-42eb...6f03b10da506...
[  OK  ] Started LSB: Set preliminary keymap.
[  OK  ] Activated swap Swap Partition.
[  OK  ] Activated swap /dev/disk/by-uuid/a87e908b-eaf4-42eb-8b18-6f03b10da506.
[  OK  ] Created slice system-ifup.slice.
[  OK  ] Reached target Swap.
         Starting Remount Root and Kernel File Systems...
[  OK  ] Created slice system-lvm2\x2dpvscan.slice.
         Starting LVM2 PV scan on device 8:3...
[  OK  ] Started Remount Root and Kernel File Systems.
         Starting Load/Save Random Seed...
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started udev Wait for Complete Device Initialization.
         Starting Activation of LVM2 logical volumes...
         Starting Copy rules generated while the root was ro...
[  OK  ] Started LVM2 PV scan on device 8:3.
[  OK  ] Started Copy rules generated while the root was ro.
[  OK  ] Found device /dev/mapper/tank-data.
[  OK  ] Started Activation of LVM2 logical volumes.
[  OK  ] Reached target Encrypted Volumes.
         Starting Activation of LVM2 logical volumes...
         Starting File System Check on /dev/mapper/tank-data...
[  OK  ] Started Activation of LVM2 logical volumes.
         Starting Monitoring of LVM2 mirrors, snapshots etc. ...ress polling...
[  OK  ] Started Monitoring of LVM2 mirrors, snapshots etc. u...ogress polling.
[   12.650827] systemd-fsck[979]: /sbin/fsck.xfs: XFS file system.
[  OK  ] Started File System Check on /dev/mapper/tank-data.
         Mounting /srv...
[FAILED] Failed to mount /srv.
See 'systemctl status srv.mount' for details.
[DEPEND] Dependency failed for Local File Systems.
[  OK  ] Closed ACPID Listen Socket.
[  OK  ] Stopped Getty on tty1.
[  OK  ] Stopped Serial Getty on ttyS1.
[  OK  ] Stopped getty on tty2-tty6 if dbus and logind are not available.
[  OK  ] Stopped target Graphical Interface.
[  OK  ] Stopped target Multi-User System.
[  OK  ] Stopped Deferred execution scheduler.
[  OK  ] Stopped Prometheus exporter for machine metrics.
[  OK  ] Stopped diamond - A system statistics collector for graphite.
[  OK  ] Stopped Nagios Remote Plugin Executor.
[  OK  ] Stopped OpenBSD Secure Shell server.
[  OK  ] Stopped Prometheus exporter for MySQL server.
[  OK  ] Stopped LLDP daemon.
[  OK  ] Stopped Regular background program processing daemon.
[  OK  ] Stopped /etc/rc.local Compatibility.
[  OK  ] Stopped Login Service.
[  OK  ] Reached target Login Prompts.
[  OK  ] Stopped LSB: exim Mail Transport Agent.
[  OK  ] Stopped LSB: Start Bacula File Daemon at boot time.
[  OK  ] Stopped LSB: Start and stop bmc-watchdog.
[  OK  ] Stopped LSB: Monitor for system resources and process activity.
[  OK  ] Stopped LSB: Start/stop sysstat's sadc.
[  OK  ] Stopped LSB: Start and stop ipmidetectd.
[  OK  ] Stopped LSB: Machine Check Exceptions (MCE) collector & decoder.
[  OK  ] Stopped LSB: process and login accounting.
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Reached target Remote File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
[  OK  ] Stopped System Logging Service.
[  OK  ] Stopped target Basic System.
[  OK  ] Reached target Paths.
[  OK  ] Reached target Timers.
[  OK  ] Stopped target System Initialization.
         Starting Create Volatile Files and Directories...
         Starting LSB: Prepare console...
         Starting LSB: Raise network interfaces....
[  OK  ] Closed Syslog Socket.
[  OK  ] Reached target Sockets.
         Starting Emergency Shell...
[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.
[  OK  ] Started Create Volatile Files and Directories.
[  OK  ] Started LSB: Prepare console.
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
         Starting LSB: Set console font and keymap...
         Starting Network Time Synchronization...
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Started Update UTMP about System Boot/Shutdown.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.
[  OK  ] Started Network Time Synchronization.
[  OK  ] Started LSB: Set console font and keymap.
[  OK  ] Reached target System Time Synchronized.
[  OK  ] Started LSB: Raise network interfaces..
         Starting ifup for eth0...
[  OK  ] Started ifup for eth0.
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
         Starting LSB: ferm firewall configuration...
Welcome to emergGive root password for maintenance
(or type Control-D to continue):
root@dbstore1001:~# systemctl -l status srv.mount
● srv.mount - /srv
   Loaded: loaded (/etc/fstab)
   Active: failed (Result: exit-code) since Tue 2018-02-06 12:46:24 UTC; 2min 9s ago
    Where: /srv
     What: /dev/mapper/tank-data
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
  Process: 987 ExecMount=/bin/mount -n /dev/mapper/tank-data /srv -t xfs (code=exited, status=32)

Feb 06 12:46:24 dbstore1001 mount[987]: mount: mount /dev/mapper/tank-data on /srv failed: Bad message
Feb 06 12:46:24 dbstore1001 systemd[1]: srv.mount mount process exited, code=exited status=32
Feb 06 12:46:24 dbstore1001 systemd[1]: Failed to mount /srv.
Feb 06 12:46:24 dbstore1001 systemd[1]: Unit srv.mount entered failed state.

[  171.926534] XFS (dm-0): Mounting V4 Filesystem
[  172.507461] XFS (dm-0): failed to locate log tail
[  172.507464] XFS (dm-0): log mount/recovery failed: error -74
[  172.507491] XFS (dm-0): log mount failed

The LD can be seen finely though - but might be corrupted

root@dbstore1001:~# megacli -LDInfo -L0 -a0


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 10.913 TB
Sector Size         : 512
Mirror Data         : 10.913 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only

I have brought the system back without mounting srv (commented on fstab on the emergency console) to at least bring the server back with an usable ssh connection.
At a first glance /srv looks corrupted and unusable.

This is what xfs_repair (dry run) shows:

root@dbstore1001:~# xfs_repair -n -v /dev/mapper/tank-data
Phase 1 - find and verify superblock...
        - block cache size set to 12189560 entries
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
agi unlinked bucket 5 is 3205957 in ag 2 (inode=8593140549)
agi unlinked bucket 6 is 3205766 in ag 2 (inode=8593140358)
agi unlinked bucket 31 is 3205919 in ag 2 (inode=8593140511)
agi unlinked bucket 42 is 3205674 in ag 2 (inode=8593140266)
sb_icount 208192, counted 219712
sb_ifree 7249, counted 1968
sb_fdblocks 382288583, counted 344797319
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 9
        - agno = 10
        - agno = 7
        - agno = 8
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 8593140266, would move to lost+found
disconnected inode 8593140357, would move to lost+found
disconnected inode 8593140358, would move to lost+found
disconnected inode 8593140511, would move to lost+found
disconnected inode 8593140549, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 8593140266 nlinks from 0 to 1
would have reset inode 8593140357 nlinks from 0 to 1
would have reset inode 8593140358 nlinks from 0 to 1
would have reset inode 8593140511 nlinks from 0 to 1
would have reset inode 8593140549 nlinks from 0 to 1
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Tue Feb  6 13:08:35 2018

Phase           Start           End             Duration
Phase 1:        02/06 13:08:06  02/06 13:08:06
Phase 2:        02/06 13:08:06  02/06 13:08:08  2 seconds
Phase 3:        02/06 13:08:08  02/06 13:08:34  26 seconds
Phase 4:        02/06 13:08:34  02/06 13:08:34
Phase 5:        Skipped
Phase 6:        02/06 13:08:34  02/06 13:08:35  1 second
Phase 7:        02/06 13:08:35  02/06 13:08:35

Total run time: 29 seconds

As read_only the FS can be mounted (only if the recovery is skipped):

root@dbstore1001:~# mount -o ro -o norecovery -n /dev/mapper/tank-data /srv -t xfs
root@dbstore1001:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    11T  9.6T  1.3T  89% /srv
root@dbstore1001:~# ls -lh /srv/sqldata/ib_logfile0
-rw-rw---- 1 mysql mysql 4.0G Feb  6 12:06 /srv/sqldata/ib_logfile0
root@dbstore1001:~# umount /srv
root@dbstore1001:~#
root@dbstore1001:~# dmesg | tail -n2
[ 1323.022130] XFS (dm-0): Mounting V4 filesystem in no-recovery mode. Filesystem will be inconsistent.
[ 1345.455632] XFS (dm-0): Unmounting Filesystem

This is what triggered the crash:

Feb  6 12:06:54 dbstore1001 kernel: [10982464.366365] megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0.
Feb  6 12:06:54 dbstore1001 kernel: [10982464.366371] megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0.
Feb  6 12:08:45 dbstore1001 kernel: [10982575.581419] megaraid_sas 0000:03:00.0: Diag reset adapter never cleared megasas_adp_reset_fusion 2713
Feb  6 12:10:37 dbstore1001 kernel: [10982686.920489] megaraid_sas 0000:03:00.0: Diag reset adapter never cleared megasas_adp_reset_fusion 2713

• Marostegui updated the task description. (Show Details)Feb 6 2018, 1:27 PM

xfs_repair was run, and /srv can be mounted.
Some manual writes were good.
I have started MySQL to see how it goes with the recovery...

MySQL won't start:

InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 23926.
InnoDB: You may have to recover from a backup.
2018-02-06 13:51:56 7f58983fc700 InnoDB: Page dump in ascii and hex (16384 bytes):
 len 16384; hex 46ad899500005d7600015c0300009f4e00003c4a726d23dd45bf000000000000000000021379009d3d8f84b5072418122ba70005000002d900000012860e324100000000000000068d7300000000000000000000000000000000000000000100022949696e66696d756d0004000b000073757072656d756d2000101f2e002dd37e12009a2420001827a9002dd9a712dd4ea9200020027d002dd89312e7b724200028307f002dd96812e7c54220003000f7002dde2f12e69a5d2000380068002dd6f712011945200040000d002dd92e120119d020004836b1002dd92e120119d12000500666002dd5e112011a32200058ffcc002dd68c12011a8f2000600333002ddccc12011d14200068000d002ddcde120121dc200070ffe6002ddcde120121dd2000780186002de0ac1201229320008002b1002dd59c12e0e2962000883bec002ddfc812e59371270090000d002dd97912dfac172000983bdf002dd97912e1161d2000a00ddd002dd9a81204c77b2000a81e9f002dd9a712e28ea12400b0027d002dd9a712f98cff0800b808bc002ddb7112e616520000c0028a002dddfa12e614f40000c833b2002dde2f12eced252000d0035a002dd97912ee23312400d82d32002ddb6612ee289c2000e0012b002dd9a712eddb0a2000e832ae002ddc69120a110d2700f0292f002dd93f12e9ee9c2000f8000d002dddb0120d597720010037c2002dddb0120d59780001081ccb002dd69413030b6e200110227b002ddd4f120da61a2001182cfe002ddda812f249f12001203ace002dd006120efb2e200128fff3002dd006120efb892001302698002dd006120f006f200138008f002dd93f12e9ef082001400bef002dd968120f48dc200148000d002dd08a120fa910200150000d002dd08a120fa9112001582eab002dd08a120fa912240160ff8b002ddda812ee495d200168ff71002dd1c31217e1972001700374002dde8912f8ef490001782546002dd74512f95ed9250180370c002dd92d12188d33240188300a002dd9a712f1841124019026a5002dd93f12f24e1a200198ffe6002dd9a712ef94c22001a006a7002dd9f0121b99600001a8ffbf002dd89312f95c972001b032e2002dd0e3121c18ba2001b82365002dd3b9121c77222001c026e



<snip>


 $    -     / $  w -     "'$  @ -     k $    -       $    -     T $    -    =Q $    -    =R$$    -    =S $    -    =T $    -    =U $    -    =V$$    -    =W $    -    =X %    -    =Y %    -    =Z$%    -    =[ %    -    =\ %    -    =] %(   -    =^$%0   -    =_ %8 / -       %@   - E     %H   -     [ %P   -    O  %X 1 -     }$%` F -    b  %h w -    E  %p | - y     %x   -       %    -       %  c -     O %  G -    n  %    -    .  %    -                      p   L       | H       x D       t @       p <       l 8       h 4       d 0       ` ,       \ (       X $       T         P         L       | H       x D       t @       p <       l 8                                                                                                         p/ 7  n -     ":T9 #   9  -  : /51 : :{; '<-   7 " 6^  2 !  Y  6 :: n1J      #     2 +    t)D7   : *       ; 8L   D C    = %  )#S2   -T6D)^  - ;e;>6x {,P1 7  p/ & +    M    :  a,      P  '     ( 3l   2  (   '}6*5  }" , :n  3   =`:  *' ( & 9)  96!$#     ) ; .  6  ;r8 ;1.  . _&8; $  6   H$    < 8  #,! *|&  j8    c   t> ;^;
InnoDB: End of page dump
2018-02-06 13:51:56 7f58983fc700 InnoDB: uncompressed page, stored checksum in field1 1185778069, calculated checksums for field1: crc32 300304464, innodb 2912235645, none 3735928559, stored checksum in field2 3303769716, calculated checksums for field2: crc32 300304464, innodb 2162160544, none 3735928559, page LSN 15434 1919755229, low 4 bytes of LSN at page end 1050032990, page number (if stored to page already) 23926, space id (if created with >= MySQL-4.1.1 and stored already) 136057
InnoDB: Page may be an index page where index id is 429427
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 23926.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
InnoDB: Ending processing because of a corrupt database page.
2018-02-06 13:51:56 7f58983fc700  InnoDB: Assertion failure in thread 140018488166144 in file buf0buf.cc line 4511
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
180206 13:51:56 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.0.32-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=252
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 684402 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
mysys/stacktrace.c:268(my_print_stacktrace)[0xbc6b6e]
sql/signal_handler.cc:159(handle_fatal_signal)[0x73961f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f616fcff890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f616e65c067]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f616e65d448]
buf/buf0buf.cc:4511(buf_page_io_complete(buf_page_t*))[0xaac02b]
fil/fil0fil.cc:5820(fil_aio_wait(unsigned long))[0xaf6de6]
srv/srv0start.cc:536(io_handler_thread)[0xa46055]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8064)[0x7f616fcf8064]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f616e70f62d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
180206 13:51:58 mysqld_safe mysqld from pid file /srv/sqldata/dbstore1001.pid ended

jcrespo subscribed.Feb 6 2018, 2:35 PM

• Marostegui moved this task from Triage to In progress on the DBA board.Feb 7 2018, 10:37 AM

claiming it for cleaning up purposes only.

Change 408812 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Remove dbstore1001 role

https://gerrit.wikimedia.org/r/408812

gerritbot added a project: Patch-For-Review.Feb 7 2018, 2:33 PM

jcrespo mentioned this in T159430: convert dbstore1001 to multi-instance + InnoDB compressed by importing db shards to it.Feb 7 2018, 2:38 PM

@Marostegui, let me know what you think of the plan:

Deploy the above patch
Move current backup files to dbstore2001
Reimage and format all partitions of dbstore1001, including a stretch upgrade
In parallel, try to do the goal: T184697 and its related tickets.

Change 408812 merged by Jcrespo:
[operations/puppet@production] mariadb: Remove dbstore1001 role

https://gerrit.wikimedia.org/r/408812

In T186596#3952677, @jcrespo wrote:

@Marostegui, let me know what you think of the plan:

Deploy the above patch

Move current backup files to dbstore2001

Reimage and format all partitions of dbstore1001, including a stretch upgrade

In parallel, try to do the goal: T184697 and its related tickets.

Agreed. The only thing I would add to this plan would be to double check with Chris whether we can upgrade BIOS and/or controller's firmware because I guess it has not been done in a long time. Not a super important thing, but probably worth doing it before this server is back in production.

I am currently on step 2, "Moving current backup files to dbstore2001", FYI, /srv/backups/_mysqldump, will take a break and wait for its completion tomorrow.

@Cmjohnson Please tell me where you are available for a BIOS/firmware/etc. upgrade. We may want to rebuild the RAID as we did some time ago, although I can handle that myself.

I will not be available this week..Let's circle back to this mid-week next
week please.

• Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Feb 7 2018, 10:35 PM

I'm setting this to normal priority in my dc-ops triaging, as it doesn't seem to be prioritized above other work, but just is in the normal queue. If anyone involved disagrees, feel free to change it, but it seems bad to leave it unprioritized against the rest of the on-site tasks.

jcrespo raised the priority of this task from Medium to High.Feb 8 2018, 7:13 PM

This is high for us DBAs, not high for dc-ops (but we cannot express that difference).

Change 412938 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Remove s3 from dbstore2001 to save space

https://gerrit.wikimedia.org/r/412938

Change 412938 merged by Jcrespo:
[operations/puppet@production] mariadb: Remove s3 from dbstore2001 to save space

https://gerrit.wikimedia.org/r/412938

Mentioned in SAL (#wikimedia-operations) [2018-02-27T12:35:39Z] <marostegui> Remove /srv/tmp/dbstore1001 files from es1017 to free up space - T186596

• Marostegui moved this task from In progress to Pending comment on the DBA board.Feb 27 2018, 2:24 PM

jcrespo removed jcrespo as the assignee of this task.Mar 6 2018, 4:56 PM

We have copied away everything we needed to keep- we are blocked on DC ops to do the full reset (firmware / BIOS upgrade, RAID reset and reimage). It is set as spare so it shouldn't alert on anything critical when done.

jcrespo removed a project: Patch-For-Review.Mar 6 2018, 5:01 PM

• Cmjohnson moved this task from Up next to High Priority Task on the ops-eqiad board.Mar 14 2018, 7:10 PM

Change 419727 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] prometheus-mysql-exporter: Reflect latest m2 changes, remove dbstore1001

https://gerrit.wikimedia.org/r/419727

gerritbot added a project: Patch-For-Review.Mar 15 2018, 12:30 PM

Change 419727 merged by Jcrespo:
[operations/puppet@production] prometheus-mysql-exporter: Reflect latest m2 changes, remove dbstore1001

https://gerrit.wikimedia.org/r/419727

Change 420767 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] dblists: Remove dbstore1001 for the list of hosts

https://gerrit.wikimedia.org/r/420767

Change 420767 merged by Jcrespo:
[operations/software@master] dblists: Remove dbstore1001 for the list of hosts

https://gerrit.wikimedia.org/r/420767

Change 423936 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbstore1001: Disable notifications

https://gerrit.wikimedia.org/r/423936

Change 423936 merged by Jcrespo:
[operations/puppet@production] dbstore1001: Disable notifications

https://gerrit.wikimedia.org/r/423936

Change 423942 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Prepare dbstore1001 for stretch reimage

https://gerrit.wikimedia.org/r/423942

Change 423942 merged by Jcrespo:
[operations/puppet@production] mariadb: Prepare dbstore1001 for stretch reimage

https://gerrit.wikimedia.org/r/423942

This is blocked on @Cmjohnson to have a gap for firmware and BIOS upgrade + RAID rebuild as asked here T186596#3953599

The host is fully depooled and empty of useful data, and notifications have been disabled- will be reimaged when this is done. No need to wait for us like other db hosts- dbstore1001 as it is now can be shutdown uncleanly with no danger.

I started the idrac firmware update, it is taking 5+ minutes to update. when done, it should show version 2.52 for idrac firmware.

I neglected to note the old bios and drac versions, but they are now latest versions each and done.

h710 mini controller had firmware 21.2.0-0007 and has been upgraded to 21.3.5-0002.

@jcrespo: Did you want to handle the raid rebuild? I'm not exactly sure what you want? (Just to wipe it all out and start over?)

System is now booted back into the OS, and is sitting idle. Latest idrac, bios, and raid controller firmware updates applied.

Don't worry, I can boot into RAID manager and do it myself. Thanks!

I moved you into "Blocked" because I don't see a better option (you do not have, like us an "All is done in our side but that task cannot be resolved" column).

Change 424297 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbstore1001: Set puppet role as mariadb-backups

https://gerrit.wikimedia.org/r/424297

Change 424297 merged by Jcrespo:
[operations/puppet@production] dbstore1001: Set puppet role as mariadb-backups

https://gerrit.wikimedia.org/r/424297

dbstore1001 is back in use. Thanks for everyone that helped upgrading it and recover from it. As a reminder, (I made this mistake) every time you destroy/change a RAID, we have to remember to change the default boot to "C:", and not network. Happily, this was scheduled for reimage anyway.

The host will be destined for backups, but right now it will only make them for a few misc hosts. Further setup will be done on separate hosts.

Change 425086 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbstore: Reenable alerts for dbstore1001 after reset

https://gerrit.wikimedia.org/r/425086

Change 425086 merged by Jcrespo:
[operations/puppet@production] dbstore: Reenable alerts for dbstore1001 after reset