Page MenuHomePhabricator

labsdb1001 crashed - storage issue
Closed, ResolvedPublic

Description

Hello,

labsdb1001 crashed due to storage malfunction:

[Wed Nov  1 03:30:31 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 03:30:31 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 03:30:31 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 03:30:31 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 03:30:31 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 03:30:31 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 03:30:31 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 03:30:31 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 03:30:31 2017] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Nov  1 03:30:31 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 03:30:31 2017] Sense Key : Medium Error [current]
[Wed Nov  1 03:30:31 2017] Info fld=0x10327bf0
[Wed Nov  1 03:30:31 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 03:30:31 2017] Add. Sense: Unrecovered read error
[Wed Nov  1 03:30:31 2017] sd 7:0:2:0: [sdd] CDB:
[Wed Nov  1 03:30:31 2017] Read(10): 28 00 15 fa 6e 60 00 00 20 00
[Wed Nov  1 03:30:31 2017] end_request: critical medium error, dev sdd, sector 368733792
[Wed Nov  1 06:26:29 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 06:26:29 2017] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Nov  1 06:26:29 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 06:26:29 2017] Sense Key : Medium Error [current]
[Wed Nov  1 06:26:29 2017] Info fld=0x35dff370
[Wed Nov  1 06:26:29 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 06:26:29 2017] Add. Sense: Unrecovered read error
[Wed Nov  1 06:26:29 2017] sd 7:0:2:0: [sdd] CDB:
[Wed Nov  1 06:26:29 2017] Read(10): 28 00 35 df f3 60 00 00 20 00
[Wed Nov  1 06:26:29 2017] end_request: critical medium error, dev sdd, sector 903869296
[Wed Nov  1 06:28:44 2017] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Nov  1 06:28:44 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 06:28:44 2017] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Nov  1 06:28:44 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 06:28:44 2017] Sense Key : Medium Error [current]
[Wed Nov  1 06:28:44 2017] Info fld=0x35dff390
[Wed Nov  1 06:28:44 2017] sd 7:0:2:0: [sdd]
[Wed Nov  1 06:28:44 2017] Add. Sense: Unrecovered read error
[Wed Nov  1 06:28:44 2017] sd 7:0:2:0: [sdd] CDB:
[Wed Nov  1 06:28:44 2017] Read(10): 28 00 35 df f3 60 00 00 20 00
[Wed Nov  1 06:28:44 2017] end_request: critical medium error, dev sdd, sector 903869280

And it crashed MySQL (twice, one at 3:30 with the first storage issue and one at 6:26 with the second error):

2017-11-01 03:30:23 7f748d115700  InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
171101  3:30:23 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation
pure virtual method called
pure virtual method called
pure virtual method called
terminate called recursively
pure virtual method called
terminate called recursively
pure virtual method called
terminate called recursively
terminate called without an active exception
terminate called without an active exception
171101  3:30:23 Aborted
171101 03:30:47 mysqld_safe Number of processes running now: 0
171101 03:30:47 mysqld_safe mysqld restarted
171101  3:30:47 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.22-MariaDB) starting as process 14870 ...




2017-11-01 06:26:21 7fa8297fc700  InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
171101  6:26:21 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation
pure virtual method called
terminate called without an active exception
171101  6:26:21 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.0.22-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=67
max_threads=1002
thread_count=47
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2331201 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x7fa85f8e9008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fb3890cae40 thread_stack 0x30000
(my_addr_resolve failure: fork)
/opt/wmf-mariadb10/bin/mysqld(my_print_stacktrace+0x2e) [0xbdd6ee]
/opt/wmf-mariadb10/bin/mysqld(handle_fatal_signal+0x390) [0x73dc40]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fb51589c330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7fb5146b0c37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fb5146b4028]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb514da9535]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6) [0x7fb514da76d6]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703) [0x7fb514da7703]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5f1bf) [0x7fb514da81bf]
/opt/wmf-mariadb10/bin/mysqld(MYSQLparse(THD*)+0x13755) [0x701bf5]
/opt/wmf-mariadb10/bin/mysqld() [0x5c88b9]
/opt/wmf-mariadb10/bin/mysqld(mysql_parse(THD*, char*, unsigned int, Parser_state*)+0x9f) [0x5d992f]
/opt/wmf-mariadb10/bin/mysqld(dispatch_command(enum_server_command, THD*, char*, unsigned int)+0x1864) [0x5db874]
/opt/wmf-mariadb10/bin/mysqld(do_handle_one_connection(THD*)+0x21b) [0x6a585b]
/opt/wmf-mariadb10/bin/mysqld(handle_one_connection+0x39) [0x6a5919]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7fb515894184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb514777ffd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fa83c846020): is an invalid pointer
Connection ID (thread ID): 13
Status: NOT_KILLED

Event Timeline

I am trying to start MySQL but it failing on storage so I think this server is no longer available as storage seems really broken:

171101  6:57:13 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: Last MySQL binlog file position 0 63622558, file name ./labsdb1002-bin.000368
171101  6:57:59 [Note] InnoDB: 128 rollback segment(s) are active.
171101  6:57:59 [Note] InnoDB: Waiting for purge to start
171101  6:57:59 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.26-74.0 started; log sequence number 87541047055884
Wed Nov  1 06:57:59 2017 TokuFT recovery starting in env /srv/sqldata/
Wed Nov  1 06:57:59 2017 TokuFT recovery scanning backward from 361046384885
Wed Nov  1 06:57:59 2017 TokuFT recovery bw_end_checkpoint at 361046384885 timestamp 1509517547439726 xid 361046376061 (bw_newer)
Wed Nov  1 06:57:59 2017 TokuFT recovery bw_begin_checkpoint at 361046376061 timestamp 1509517546277097 (bw_between)
Wed Nov  1 06:57:59 2017 TokuFT recovery turning around at begin checkpoint 361046376061 time 1162629
Wed Nov  1 06:57:59 2017 TokuFT recovery starts scanning forward to 361046384885 from 361046376061 left 8824 (fw_between)
171101  6:58:01 [ERROR] InnoDB: Tried to read 16384 bytes at offset 416573669376. Was only able to read 0.
2017-11-01 06:58:01 7fcc3a3fc700  InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.

Change 387772 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] Revert "Revert "labsdb: Switch dns for labsdb1001 to labsdb1003""

https://gerrit.wikimedia.org/r/387772

Change 387772 merged by Madhuvishy:
[operations/puppet@production] Revert "Revert "labsdb: Switch dns for labsdb1001 to labsdb1003""

https://gerrit.wikimedia.org/r/387772

btw, the RAID keeps saying Optimal :-)

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only
Number of Spans: 6
Span: 0 - Number of PDs: 2

Mentioned in SAL (#wikimedia-cloud) [2017-11-01T07:11:26Z] <madhuvishy> Clear nscd cache across all projects post labsdb dns switchover T179464

I have disabled notifications and downtimed labsdb1001

Disk setup for labsdb1001

  • /dev/sda -> 3.271TB after Hardware RAID 10 (H800, External shelf, 12 Disks 558.911 GB each)
  • /dev/sd[b,c,d,e,f,g,h,i] -> 7 SSDs, 447.1G each
madhuvishy@labsdb1001:~$ lsblk -io KNAME,TYPE,SIZE,MODEL
KNAME TYPE   SIZE MODEL
sda   disk   3.3T PERC H800
dm-0  lvm    4.5T
sdb   disk 447.1G SanDisk SDSSDX48
sdb1  part 421.1G
sdb2  part   7.5G
sdb3  part  18.6G
sdc   disk 447.1G SanDisk SDSSDX48
dm-0  lvm    4.5T
sdd   disk 447.1G SanDisk SDSSDX48
dm-0  lvm    4.5T
sde   disk 447.1G SanDisk SDSSDX48
dm-0  lvm    4.5T
sdf   disk 447.1G SanDisk SDSSDX48
dm-0  lvm    4.5T
sdg   disk 447.1G SanDisk SDSSDX48
dm-0  lvm    4.5T
sdh   disk 447.1G SanDisk SDSSDX48
dm-0  lvm    4.5T
sdi   disk 447.1G SanDisk SDSSDX48
dm-0  lvm    4.5T
sr0   rom   1024M CDDVDW TS-L633C

LVM Setup:

madhuvishy@labsdb1001:~$ sudo pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda   tank lvm2 a--    3.27t 1.83t
  /dev/sdc   tank lvm2 a--  447.13g    0
  /dev/sdd   tank lvm2 a--  447.13g    0
  /dev/sde   tank lvm2 a--  447.13g    0
  /dev/sdf   tank lvm2 a--  447.13g    0
  /dev/sdg   tank lvm2 a--  447.13g    0
  /dev/sdh   tank lvm2 a--  447.13g    0
  /dev/sdi   tank lvm2 a--  447.13g    0
madhuvishy@labsdb1001:~$ sudo vgs
  VG   #PV #LV #SN Attr   VSize VFree
  tank   8   1   0 wz--a- 6.33t 1.83t
madhuvishy@labsdb1001:~$ sudo lvs
  LV   VG   Attr      LSize Pool Origin Data%  Move Log Copy%  Convert
  data tank -wi-ao--- 4.50t
madhuvishy@labsdb1001:~$ df -Th
Filesystem            Type      Size  Used Avail Use% Mounted on
udev                  devtmpfs   95G   12K   95G   1% /dev
tmpfs                 tmpfs      19G  944K   19G   1% /run
/dev/sdb1             ext4      415G   59G  336G  15% /
none                  tmpfs     4.0K     0  4.0K   0% /sys/fs/cgroup
none                  tmpfs     5.0M     0  5.0M   0% /run/lock
none                  tmpfs      95G     0   95G   0% /run/shm
none                  tmpfs     100M     0  100M   0% /run/user
/dev/sdb3             ext4       19G   44M   18G   1% /tmp
/dev/mapper/tank-data xfs       4.5T  3.9T  683G  86% /srv

Any chance to recover data?

We are working right now to see how many bad blocks/sectors there are on the disk that failed. If the error count is reasonable, we will attempt to use ddrescue to copy the disk to another location. There is some chance that this will succeed, but it is not a high probability. We should know by sometime tomorrow one way or the other if any recovery is possible. This server will not be put back into full usage, and we will not be attempting to migrate user created databases from it to c3.labsdb. Our best case scenario is that we will be able to start the MySQL service back up and allow users to connect to c1.labsdb to look for critical data that they had not backed up yet.

Both labsdb1001 and labsdb1003 suffer from some fatal flaws in their storage configuration. Neither is using RAID to provide resiliency to single or multiple disk hardware failures. This is one of many technical flaws that the DBA team has attempted to address with the new *.{analytics,web}.db.svc.eqiad.wmflabs cluster.

It looks like it may be time to say goodbye to this server. I've spent some time today looking at the state of the storage configuration, and the damage, and if anything at all might be possible to recover the disk.

Our initial instinct yesterday was that this was part of a RAID, and we may be able to fail the disk and see if we can recover it. On looking more today I found out that the underlying storage for MySQL is composed of a 3TB RAID 10 disk, and 7 SSDs. The disk that failed, /dev/sdd, is one of the SSDs, and is not part of the RAID. This makes it really hard to replace/recover this disk in any way that maintains the integrity of the data. (See T179464#3727139 for more info on the set up.)

Looking at the damage to disk with badblocks, 120 bad blocks surface in this test - https://phabricator.wikimedia.org/P6241.

One recommendation was to try to use ddrescue to try and recover the drive into another SSD, but this would require shuffling the physical extents in the LVM around quite a bit, removing an SSD from the LVM, and attempting to rescue to it, and then switch out the bad disk with more shuffling. Firstly, ddrescue may not work at all in recovering the bad blocks. Second, we might end up creating more damage in the process, given that these disks are already old and failing. Third, even if it work, the mysql ibd files will be corrupted, and we won't be able to get the mysql server up and running to actually let users access any data that may have been lost in the user tables.

This server is also already on its way to planned decommissioning in 5 weeks, and as part of our announcements and planning for the reboot we did anticipate that irrecoverable disk failure may occur. Looks like we have to decomm it earlier than we expected to now.

The current status is that all traffic to the *.labsdb shards are being directed to labsdb1003. That server is also being decomm-ed in 5 weeks, and is also under higher load now. User tables that only existed in labsdb1001 have been lost.

Hi,

I have managed to bring MySQL up on a read-only state (with innodb_force_recovery=2).
Now I can select stuff and see some data.
If you guys need to recover some data, we can try now to see what we can get, but we shouldn't assume the data is 100% available.
I also don't know for how long MySQL will be up without crashing or dying again.

Mentioned in SAL (#wikimedia-operations) [2017-11-02T07:59:13Z] <marostegui> Stop mysql event scheduler on labsdb1001 - T179464

bd808 assigned this task to Marostegui.
bd808 removed a project: Patch-For-Review.

I announced the current precarious read-only state at https://lists.wikimedia.org/pipermail/cloud-announce/2017-November/000008.html

Thank you @Marostegui and @madhuvishy for your work to give our users another chance to salvage their data. I'm going to close this as "resolved". If the server crashes again we should probably reopen this ticket instead of making another one.

labsdb1001 is no longer available, not even in read_only.
storage has definitely given up.

Followup T142807, there is nothing else we can do.