Page MenuHomePhabricator

Degraded RAID on restbase-dev1006
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host restbase-dev1006. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid1] [raid0] 
md2 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      3004026880 blocks super 1.2 512k chunks
      
md1 : active (auto-read-only) raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      976320 blocks super 1.2 [4/4] [UUUU]
      
md0 : active raid1 sda1[0] sdd1[3] sdc1[2](F) sdb1[1]
      29279232 blocks super 1.2 [4/3] [UU_U]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2018, 3:46 PM
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jan 26 2018, 7:10 PM

@Cmjohnson the machine can be shut at will since it isn't in production. Looks like the intel ssd failed to respond at some point, and/or the controller didn't like it

Jan 22 15:44:07 restbase-dev1006 kernel: [944387.663955] ata3: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Jan 22 15:44:07 restbase-dev1006 kernel: [944387.697827] ata3: irq_stat 0x00400040, connection status changed
Jan 22 15:44:07 restbase-dev1006 kernel: [944387.725371] ata3: SError: { HostInt PHYRdyChg 10B8B DevExch }
Jan 22 15:44:07 restbase-dev1006 kernel: [944387.751870] ata3: hard resetting link
Jan 22 15:44:08 restbase-dev1006 kernel: [944388.462757] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 22 15:44:08 restbase-dev1006 kernel: [944388.463525] ata3.00: configured for UDMA/133
Jan 22 15:44:08 restbase-dev1006 kernel: [944388.463531] ata3: EH complete
Jan 22 15:44:08 restbase-dev1006 kernel: [944388.506776] ata3.00: Enabling discard_zeroes_data
Jan 22 15:44:08 restbase-dev1006 kernel: [944388.506838] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.147056] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.177631] ata3.00: irq_stat 0x40000001
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.196496] ata3.00: failed command: WRITE DMA
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.217599] ata3.00: cmd ca/00:01:08:08:00/00:00:00:00:00/e0 tag 22 dma 512 out
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.217599]          res 53/10:00:00:00:00/00:00:00:00:00/00 Emask 0x80 (invalid argument)
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.289331] ata3.00: status: { DRDY SENSE ERR }
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.310760] ata3.00: error: { IDNF }
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.328424] ata3.00: configured for UDMA/133
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.328431] ata3: EH complete
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.338837] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.368908] ata3.00: irq_stat 0x40000001
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.387668] ata3.00: failed command: WRITE DMA
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.408667] ata3.00: cmd ca/00:01:08:08:00/00:00:00:00:00/e0 tag 24 dma 512 out
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.408667]          res 51/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.480766] ata3.00: status: { DRDY ERR }
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.499225] ata3.00: error: { IDNF }
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.516528] ata3.00: configured for UDMA/133
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.516537] ata3: EH complete
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.526693] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.557224] ata3.00: irq_stat 0x40000001
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.575847] ata3.00: failed command: WRITE DMA
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.596820] ata3.00: cmd ca/00:01:08:08:00/00:00:00:00:00/e0 tag 27 dma 512 out
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.596820]          res 51/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.669229] ata3.00: status: { DRDY ERR }
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.688157] ata3.00: error: { IDNF }
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.705558] ata3.00: configured for UDMA/133
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.705564] ata3: EH complete
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.714863] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.744616] ata3.00: irq_stat 0x40000001
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.762877] ata3.00: failed command: WRITE DMA
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.783303] ata3.00: cmd ca/00:01:08:08:00/00:00:00:00:00/e0 tag 29 dma 512 out
Jan 22 15:44:10 restbase-dev1006 kernel: [944390.783303]          res 51/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.855621] ata3.00: status: { DRDY ERR }
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.874988] ata3.00: error: { IDNF }
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.892757] ata3.00: configured for UDMA/133
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.892767] ata3: EH complete
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.894968] ata3.00: Enabling discard_zeroes_data
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.902670] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.932840] ata3.00: irq_stat 0x40000001
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.951433] ata3.00: failed command: WRITE DMA
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.972069] ata3.00: cmd ca/00:01:08:08:00/00:00:00:00:00/e0 tag 0 dma 512 out
Jan 22 15:44:11 restbase-dev1006 kernel: [944390.972069]          res 51/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.044031] ata3.00: status: { DRDY ERR }
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.062945] ata3.00: error: { IDNF }
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.080506] ata3.00: configured for UDMA/133
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.080515] ata3: EH complete
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.090835] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.120576] ata3.00: irq_stat 0x40000001
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.138842] ata3.00: failed command: WRITE DMA
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.159601] ata3.00: cmd ca/00:01:08:08:00/00:00:00:00:00/e0 tag 17 dma 512 out
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.159601]          res 51/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.231622] ata3.00: status: { DRDY ERR }
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.250283] ata3.00: error: { IDNF }
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.267666] ata3.00: configured for UDMA/133
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.267676] sd 2:0:0:0: [sdc] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.267678] sd 2:0:0:0: [sdc] tag#17 Sense Key : Illegal Request [current] 
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.267680] sd 2:0:0:0: [sdc] tag#17 Add. Sense: Logical block address out of range
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.267682] sd 2:0:0:0: [sdc] tag#17 CDB: Write(10) 2a 00 00 00 08 08 00 00 01 00
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.267684] blk_update_request: I/O error, dev sdc, sector 2056
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.295041] blk_update_request: I/O error, dev sdc, sector 2056
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.322144] md: super_written gets error=-5
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.322147] md/raid1:md0: Disk failure on sdc1, disabling device.
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.322147] md/raid1:md0: Operation continuing on 3 devices.
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.376339] ata3: EH complete
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.398620] ata3.00: Enabling discard_zeroes_data
Jan 22 15:44:11 restbase-dev1006 kernel: [944391.414778] RAID1 conf printout:
:
Dzahn triaged this task as Normal priority.Feb 1 2018, 11:49 PM

a case has been opened with HPE

Your case was successfully submitted. Please note your Case ID: 5326748362 for future reference.

Because of the AHCI configuration the h/w does not show up in the standard log and HP has no way of proving the SSD type

After more review, it turns out these servers are using the old Intel S3610 ssds. I will need to check with @faidon and @RobH about ordering another one.

faidon assigned this task to RobH.Feb 12 2018, 9:21 AM
RobH reassigned this task from RobH to Cmjohnson.Feb 14 2018, 5:45 PM

So lshw is simply lockign up at reading SCSI on the machine, and won't output the disk model/capacity.

Chris, Can you pull defective SSD sdc? (It appears to be the one that failed) and give us the model and capacity info?

I don't want to pull this info off another restbase-dev, since they may be running different SSDs (they aren't standard systems.)

robh: INTEL SSD D S3610 Series 800GB, model ssdsc2bx800g4

RobH changed the task status from Open to Stalled.Feb 14 2018, 8:02 PM

replacement SSD order is now pending on sub-task T187369.

This is stalled until the replacement is ordered and onsite.

the ssd for /dev/sdc has been replaced. the raid needs to be fixed. resolve ticket once you're satisfied.

Eevans added a subscriber: Eevans.Mar 6 2018, 8:50 PM

I removed sdc1, sdc2, and sdc3 from md0, md1, and md2 respectively, and rebooted believing that might be the easiest way to correct the device ordering (the new drive showed as sde). Instead, the machine didn't come back up (and I don't have console access).

Eevans added a comment.Mar 6 2018, 8:52 PM

I removed sdc1, sdc2, and sdc3 from md0, md1, and md2 respectively, and rebooted believing that might be the easiest way to correct the device ordering (the new drive showed as sde). Instead, the machine didn't come back up (and I don't have console access).

To be clear: It pings, but SSH is not accessible (connection refused).

Mentioned in SAL (#wikimedia-operations) [2018-03-06T21:22:58Z] <mutante> restbase-dev1006 powercycled via console (T185494)

Dzahn added a subscriber: Dzahn.Mar 6 2018, 9:31 PM

the console was showing "root password for maintenance (or type Control-D to continue): "

I tried one powercyle and i saw:

         Starting Activation of LVM2 logical volumes...
[  OK  ] Started Activation of LVM2 logical volumes.
         Starting Monitoring of LVM2 mirrors, snapshots etc. ...ress polling...
[  OK  ] Started Monitoring of LVM2 mirrors, snapshots etc. u...ogress polling.
[ TIME ] Timed out waiting for device dev-mapper-restbase\x2d...\x2dsrv.device.
[DEPEND] Dependency failed for /srv.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for File System Check on /dev/mapp...ev1006--vg-srv.
[  OK  ] Closed ACPID Listen Socket.
..
[  OK  ] Reached target Network is Online.
         Starting LSB: ferm firewall configuration...
[  OK  ] Started LSB: ferm firewall configuration.
Welcome to emergGive root password for maintenance
(or type Control-D to continue): 
Login incorrect.
Give root password for maintenance
(or type Control-D to continue):

Mentioned in SAL (#wikimedia-operations) [2018-03-13T23:10:25Z] <mutante> restbase-dev1006 - reinstalling, manually skipping " Volume group name already in use" (T185494)

reinstalled, re-added to puppet, initial puppet run, recovered in Icinga, including:

19:33 < icinga-wm> RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active

new SSH fingerprints:

1+---------+---------+-------------------------------------------------+
2| Cipher | Algo | Fingerprint |
3+---------+---------+-------------------------------------------------+
4| RSA | MD5 | 51:bd:0e:1f:e5:07:20:26:2e:bd:31:c0:43:fa:db:86 |
5| RSA | SHA-256 | EmPT1QP9Pe5Q0VJUZoyL3a+SXyl6MG2kiMKS2ruStsk= |
6+---------+---------+-------------------------------------------------+
7| DSA | MD5 | a0:27:0b:e3:5b:a4:75:fa:27:51:c4:b7:d0:fd:aa:00 |
8| DSA | SHA-256 | /csW1MH/p/DmZx7i0OQtnKdZyq+BzYfDyCtKHI/q+rk= |
9+---------+---------+-------------------------------------------------+
10| ECDSA | MD5 | ca:cc:11:cc:e3:9b:b1:58:01:21:6d:46:8b:e5:18:64 |
11| ECDSA | SHA-256 | 94s1Akk43+LRFMDKImnRfiAlCbSpqr5JOk79llSrvqQ= |
12+---------+---------+-------------------------------------------------+
13| ED25519 | MD5 | 51:56:5c:b0:e0:39:5d:fa:a4:a4:6f:52:7f:c2:6c:b6 |
14| ED25519 | SHA-256 | bphdriZjOPjVNYR4zW/ke7iia3UwKsKFv2BNwGIKv8I= |
15+---------+---------+-------------------------------------------------+

Dzahn closed this task as Resolved.Mar 13 2018, 11:55 PM
Dzahn claimed this task.
RobH closed subtask Unknown Object (Task) as Resolved.May 31 2018, 4:38 PM