Page MenuHomePhabricator

Degraded RAID on cp1008
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host cp1008. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0
Personalities : [raid1] 
md0 : active raid1 sda1[0](F) sdb1[1]
      9756672 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Event Timeline

Looks like sda is dead:

[Wed Jul 19 07:51:28 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85c40780)
[Wed Jul 19 07:51:28 2017] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 00 c8 56 18 00 01 40 00
[Wed Jul 19 07:51:30 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85c40780)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85c40c00)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 c4 63 d8 00 00 08 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85c40c00)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85c40d80)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 5e 51 b8 00 00 08 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85c40d80)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85234300)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 04 62 d8 00 00 08 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85234300)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37cedac0)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#100 CDB: Write(10) 2a 00 00 6e 58 58 00 01 20 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37cedac0)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37cedc40)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#99 CDB: Write(10) 2a 00 00 6e 57 b8 00 00 90 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37cedc40)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37ced040)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#16 CDB: Write(10) 2a 00 00 6e 54 00 00 03 b0 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37ced040)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37ceddc0)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#98 CDB: Write(10) 2a 00 00 6e 50 00 00 04 00 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37ceddc0)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37ced4c0)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#97 CDB: Write(10) 2a 00 00 6e 4c 00 00 04 00 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37ced4c0)
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e87721c00)
[Wed Jul 19 07:51:30 2017] sd 0:0:0:0: [sda] tag#96 CDB: Write(10) 2a 00 00 6e 48 00 00 04 00 00
[Wed Jul 19 07:51:30 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e87721c00)
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#12 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00
[Wed Jul 19 07:52:11 2017] blk_update_request: I/O error, dev sda, sector 2056
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e87721c00)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 6e 48 00 00 04 00 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e87721c00)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37ced4c0)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 6e 4c 00 00 04 00 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37ced4c0)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37ceddc0)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 6e 50 00 00 04 00 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37ceddc0)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37ced040)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 00 6e 54 00 00 03 b0 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37ced040)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37cedc40)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#4 CDB: Write(10) 2a 00 00 6e 57 b8 00 00 90 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37cedc40)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1d37cedac0)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#5 CDB: Write(10) 2a 00 00 6e 58 58 00 01 20 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1d37cedac0)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85234300)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 00 04 62 d8 00 00 08 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85234300)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85c40d80)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 00 5e 51 b8 00 00 08 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85c40d80)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85c40c00)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 00 c4 63 d8 00 00 08 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85c40c00)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85c40780)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#9 CDB: Write(10) 2a 00 00 c8 56 18 00 01 40 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e85c40780)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e87796540)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 00 85 7b b0 00 00 d0 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e87796540)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e87f47780)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#11 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff9f1e87f47780)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting target reset! (sc=ffff9f1e87721c00)
[Wed Jul 19 07:52:11 2017] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 6e 48 00 00 04 00 00
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: target reset: SUCCESS (sc=ffff9f1e87721c00)
[Wed Jul 19 07:52:11 2017] mptscsih: ioc0: attempting host reset! (sc=ffff9f1e87f47780)
[Wed Jul 19 07:53:08 2017] mptscsih: ioc0: host reset: SUCCESS (sc=ffff9f1e87f47780)
[Wed Jul 19 07:53:08 2017] mptbase: ioc0: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501) cb_idx mptbase_reply
[Wed Jul 19 07:53:08 2017]  end_device-0:0: mptsas: ioc0: removing sata device: fw_channel 0, fw_id 0, phy 0,sas_addr 0x1221000000000000
[Wed Jul 19 07:53:08 2017]  phy-0:0: mptsas: ioc0: delete phy 0, phy-obj (0xffff9f1e854fa400)
[Wed Jul 19 07:53:08 2017]  port-0:0: mptsas: ioc0: delete port 0, sas_addr (0x1221000000000000)
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: Device offlined - not ready after error recovery
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 6e 48 00 00 04 00 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 7227392
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 6e 4c 00 00 04 00 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 7228416
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 00 6e 50 00 00 04 00 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 7229440
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#4 CDB: Write(10) 2a 00 00 6e 54 00 00 03 b0 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 7230464
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 2056
[Wed Jul 19 07:53:18 2017] md: super_written gets error=-5
[Wed Jul 19 07:53:18 2017] md/raid1:md0: Disk failure on sda1, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 2128
[Wed Jul 19 07:53:18 2017] md: super_written gets error=-5
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#5 CDB: Write(10) 2a 00 00 6e 57 b8 00 00 90 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 7231416
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 00 6e 58 58 00 01 20 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 7231576
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 00 04 62 d8 00 00 08 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 287448
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 00 5e 51 b8 00 00 08 00
[Wed Jul 19 07:53:18 2017] blk_update_request: I/O error, dev sda, sector 6181304
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#9 CDB: Write(10) 2a 00 00 c4 63 d8 00 00 08 00
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 00 c8 56 18 00 01 40 00
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[Wed Jul 19 07:53:18 2017] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[Wed Jul 19 07:53:18 2017] scsi target0:0:0: mptsas: ioc0: delete device: fw_channel 0, fw_id 0, phy 0, sas_addr 0x1221000000000000
[Wed Jul 19 07:53:18 2017] RAID1 conf printout:
[Wed Jul 19 07:53:18 2017]  --- wd:1 rd:2
[Wed Jul 19 07:53:18 2017]  disk 0, wo:1, o:0, dev:sda1
[Wed Jul 19 07:53:18 2017]  disk 1, wo:0, o:1, dev:sdb1
[Wed Jul 19 07:53:18 2017] RAID1 conf printout:
[Wed Jul 19 07:53:18 2017]  --- wd:1 rd:2
[Wed Jul 19 07:53:18 2017]  disk 1, wo:0, o:1, dev:sdb1
ema triaged this task as Medium priority.Jul 19 2017, 8:05 AM
ema added a project: Traffic.

@Cmjohnson please replace the disk (sda) whenever you've got the chance!

@ema is it okay to take this down..most of the time the server needs a re-install after swapping /dev/sda will this be okay?

@ema is it okay to take this down..most of the time the server needs a re-install after swapping /dev/sda will this be okay?

@Cmjohnson: yes. The machine does not serve user traffic so there's no need to depool it. Please go ahead any time. Thanks!

@ema can you verify the host name for me please. cp1008 was decom'd a long time ago.

It was decommed a long time ago, and then I revived it as a quasi-production testing machine for "temporary" use for a little while, and probably poorly documented that, and now "temporary" has stretched on a really really long time. cp1008 is the correct machine.

Mentioned in SAL (#wikimedia-operations) [2017-07-26T07:14:53Z] <ema> cp1008: use sdb only in varnish.service, waiting for Chris to replace sda T171028

Replaced the ssd, needs re-install

Cmjohnson removed a project: ops-eqiad.

@ema I replaced the ssd and reinstalled. All yours! resolve once you confirmed everything is okay