Page MenuHomePhabricator

labstore1007 crashed after storage controller errors--replace disk?
Open, HighPublic

Description

Received this alert:

Notification Type: PROBLEM
Host: labstore1007
State: DOWN
Address: 208.80.155.106
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Sat Apr 24 22:12:06 UTC 2021

rebooting from the console (which is up) after checking no serious errors in iLO and unable to get a root console to fully log in

A slow-failing disk could be part of the problem (there was an alert on Tue Apr 27 11:00:28 UTC 2021 for device 1I:1:5, but it recovered quickly on its own. Since resetting the logical volume is the only clearly logged system, a flapping drive could be part of it.

Event Timeline

Bstorm triaged this task as Unbreak Now! priority.Sat, Apr 24, 10:26 PM
Bstorm created this task.
Bstorm lowered the priority of this task from Unbreak Now! to High.Sat, Apr 24, 10:30 PM

System is back up after forcing reboot. Checking logs for a root cause.

Found this around the time of the alert:

Apr 24 22:06:51 labstore1007 kernel: [35172822.696908] hpsa 0000:05:00.0: aborted: LUN:000000c000000001 CDB:12010000040000000000000000000000
Apr 24 22:06:51 labstore1007 kernel: [35172822.696917] hpsa 0000:05:00.0: hpsa0: hpsa_update_device_info: can't get device id for host 0:C0:T0:L0       Direct-Acc
ess           MB6000GVYYU
Apr 24 22:06:51 labstore1007 kernel: [35172822.873507] hpsa 0000:05:00.0: scsi 0:0:1:0: replaced Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPath
Cap- En- Exp=0
Apr 24 22:07:00 labstore1007 kernel: [35172831.623472] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0)
SSDSmartPathCap- En- Exp=1
Apr 24 22:07:32 labstore1007 kernel: [35172864.122262] hpsa 0000:05:00.0: Command timed out.
Apr 24 22:07:32 labstore1007 kernel: [35172864.122273] hpsa 0000:05:00.0: hpsa0: hpsa_update_device_info: can't get device id for host 0:C0:T0:L0       Direct-Acc
ess           MB6000GVYYU
Apr 24 22:07:34 labstore1007 kernel: [35172865.534256] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL V
OLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1

That's not cheery

Just before that was:

Apr 24 22:05:36 labstore1007 kernel: [35172747.658395] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
Apr 24 22:05:37 labstore1007 kernel: [35172748.413472] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL V
OLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1

Going to need to check the storage, but that shouldn't crash the system in an ideal world? At root login on console, it basically hung starting a shell, so the root vols might be what this is all about. Will know more later.

root@labstore1007:~# /usr/sbin/hpssacli ctrl all show status

Smart Array P441 in Slot 3
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

The controllers are happy enough.

All the disks report they are fine (of course).

1root@labstore1007:~# hpssacli ctrl slot=3 pd all show status
2
3 physicaldrive 1E:1:1 (port 1E:box 1:bay 1, 6001.1 GB): OK
4 physicaldrive 1E:1:2 (port 1E:box 1:bay 2, 6001.1 GB): OK
5 physicaldrive 1E:1:3 (port 1E:box 1:bay 3, 6001.1 GB): OK
6 physicaldrive 1E:1:4 (port 1E:box 1:bay 4, 6001.1 GB): OK
7 physicaldrive 1E:1:5 (port 1E:box 1:bay 5, 6001.1 GB): OK
8 physicaldrive 1E:1:6 (port 1E:box 1:bay 6, 6001.1 GB): OK
9 physicaldrive 1E:1:7 (port 1E:box 1:bay 7, 6001.1 GB): OK
10 physicaldrive 1E:1:8 (port 1E:box 1:bay 8, 6001.1 GB): OK
11 physicaldrive 1E:1:9 (port 1E:box 1:bay 9, 6001.1 GB): OK
12 physicaldrive 1E:1:10 (port 1E:box 1:bay 10, 6001.1 GB): OK
13 physicaldrive 1E:1:11 (port 1E:box 1:bay 11, 6001.1 GB): OK
14 physicaldrive 1E:1:12 (port 1E:box 1:bay 12, 6001.1 GB): OK
15 physicaldrive 1E:2:1 (port 1E:box 2:bay 1, 6001.1 GB): OK
16 physicaldrive 1E:2:2 (port 1E:box 2:bay 2, 6001.1 GB): OK
17 physicaldrive 1E:2:3 (port 1E:box 2:bay 3, 6001.1 GB): OK
18 physicaldrive 1E:2:4 (port 1E:box 2:bay 4, 6001.1 GB): OK
19 physicaldrive 1E:2:5 (port 1E:box 2:bay 5, 6001.1 GB): OK
20 physicaldrive 1E:2:6 (port 1E:box 2:bay 6, 6001.1 GB): OK
21 physicaldrive 1E:2:7 (port 1E:box 2:bay 7, 6001.1 GB): OK
22 physicaldrive 1E:2:8 (port 1E:box 2:bay 8, 6001.1 GB): OK
23 physicaldrive 1E:2:9 (port 1E:box 2:bay 9, 6001.1 GB): OK
24 physicaldrive 1E:2:10 (port 1E:box 2:bay 10, 6001.1 GB): OK
25 physicaldrive 1E:2:11 (port 1E:box 2:bay 11, 6001.1 GB): OK
26 physicaldrive 1E:2:12 (port 1E:box 2:bay 12, 6001.1 GB): OK
27
28root@labstore1007:~# hpssacli ctrl slot=3 ld all show status
29
30 logicaldrive 1 (32.7 TB, 1+0): OK
31 logicaldrive 2 (32.7 TB, 1+0): OK
32
33root@labstore1007:~# hpssacli ctrl slot=1 pd all show status
34
35 physicaldrive 2I:4:1 (port 2I:box 4:bay 1, 1 TB): OK
36 physicaldrive 2I:4:2 (port 2I:box 4:bay 2, 1 TB): OK
37 physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 6001.1 GB): OK
38 physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 6001.1 GB): OK
39 physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 6001.1 GB): OK
40 physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 6001.1 GB): OK
41 physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 6001.1 GB): OK
42 physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 6001.1 GB): OK
43 physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 6001.1 GB): OK
44 physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 6001.1 GB): OK
45 physicaldrive 2I:2:1 (port 2I:box 2:bay 1, 6001.1 GB): OK
46 physicaldrive 2I:2:2 (port 2I:box 2:bay 2, 6001.1 GB): OK
47 physicaldrive 2I:2:3 (port 2I:box 2:bay 3, 6001.1 GB): OK
48 physicaldrive 2I:2:4 (port 2I:box 2:bay 4, 6001.1 GB): OK
49
50root@labstore1007:~# hpssacli ctrl slot=1 ld all show status
51
52 logicaldrive 1 (931.5 GB, 1): OK
53 logicaldrive 2 (32.7 TB, 1+0): OK

Ok, the PCI address logging the errors was 0000:05:00.0, which corresponds to the controller in slot 1 from the controller details:

Smart Array P840 in Slot 1
   Bus Interface: PCI
   Slot: 1
   Serial Number: PDNNF0ARH7306N
   Cache Serial Number: PEYFP0BRH7323Q

...

PCI Address (Domain:Bus:Device.Function): 0000:05:00.0

So, that helps a bit.

This Apr 24 22:05:36 labstore1007 kernel: [35172747.658395] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap- En- Exp=1 would then refer to the larger data volume (because that's raid 10) from the paste (logical volume 2).

The system thinks that is all good. I'm starting to think the controller flapped.

array B

   Logical Drive: 2
      Size: 32.7 TB
      Fault Tolerance: 1+0
      Heads: 255
      Sectors Per Track: 63
      Cylinders: 65535
      Strip Size: 256 KB
      Full Stripe Size: 1536 KB
      Status: OK
      MultiDomain Status: OK
      Caching:  Enabled
      Unique Identifier: 600508B1001C778B7DE10D3778FE5B36
      Disk Name: /dev/sdb
      Mount Points: None
      Logical Drive Label: 05C50128PDNNF0ARH7306N7C41
      Mirror Group 1:
         physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 6001.1 GB, OK)
      Mirror Group 2:
         physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SATA, 6001.1 GB, OK)
      Drive Type: Data
      LD Acceleration Method: Controller Cache

Ok, my shell just hung for a bit, and I see this in dmesg :(

[Sat Apr 24 22:31:40 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:32:23 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:32:23 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:05 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:09 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:36:09 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:48 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:59 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:36:59 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:38:23 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:38:26 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:38:26 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:40:37 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:40:48 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:40:48 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:43:09 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:43:10 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:43:10 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:44:34 2021] hpsa 0000:05:00.0: aborted: LUN:000000c000000001 CDB:12000000310000000000000000000000
[Sat Apr 24 22:44:34 2021] hpsa 0000:05:00.0: hpsa_update_device_info: inquiry failed, device will be skipped.
[Sat Apr 24 22:44:34 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: removed Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 22:44:42 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:44:57 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:44:57 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:44:58 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: masked Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 22:50:39 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:51:37 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:51:37 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:57:01 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:57:23 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:57:23 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:59:59 2021] hpsa 0000:05:00.0: aborted: LUN:000000c000000001 CDB:12010000090000000000000000000000
[Sat Apr 24 22:59:59 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: replaced Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 23:00:02 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 23:00:38 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 23:00:38 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 23:00:41 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: replaced Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 23:02:05 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 23:02:08 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 23:02:08 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1

Going to fail over to labstore1006 because this looks suspect.

Change 682273 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] dumps-distribution: fail over cloud NFS primary to labstore1006

https://gerrit.wikimedia.org/r/682273

Change 682273 merged by Bstorm:

[operations/puppet@production] dumps-distribution: fail over cloud NFS primary to labstore1006

https://gerrit.wikimedia.org/r/682273

Bstorm renamed this task from labstore1007 not reachable on main interface (ping failure) to labstore1007 crashed after storage controller errors.Sat, Apr 24, 11:18 PM

Ok, dumps is going to be served from labstore1006 for awhile. I've downloaded an "adu" report with the hpssaducli tools (hpssaducli -f adu-report.zip as root). I'm pleased that's readable and will go over it Monday to see if there's a useful report on what happened in there since this looks like hardware.

Unfortunately I'm not seeing anything in there.

Mentioned in SAL (#wikimedia-operations) [2021-04-27T12:00:33Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045

Mentioned in SAL (#wikimedia-operations) [2021-04-27T12:00:43Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045

There was an alert today about device 1I:1:5, but by the time I checked, the device was fine:

aborrero@labstore1007:~ $ sudo hpssacli ctrl slot=3 pd all show status

   physicaldrive 1E:1:1 (port 1E:box 1:bay 1, 6001.1 GB): OK
   physicaldrive 1E:1:2 (port 1E:box 1:bay 2, 6001.1 GB): OK
   physicaldrive 1E:1:3 (port 1E:box 1:bay 3, 6001.1 GB): OK
   physicaldrive 1E:1:4 (port 1E:box 1:bay 4, 6001.1 GB): OK
   physicaldrive 1E:1:5 (port 1E:box 1:bay 5, 6001.1 GB): OK    <--------------
[..]
aborrero added a parent task: Unknown Object (Task).Tue, Apr 27, 12:06 PM

That alert may actually be actionable at least. It would seem surprising for a failing disk to crash a server, but I've seen it before if it is resetting like mad. Let's see if we can get that replaced.

Bstorm renamed this task from labstore1007 crashed after storage controller errors to labstore1007 crashed after storage controller errors--replace disk?.Tue, Apr 27, 3:17 PM
Bstorm updated the task description. (Show Details)
Bstorm added a project: ops-eqiad.

@wiki_willy @Bstorm host is out of warranty expired Jun 1, 2020. We do not have any 6tb hard drives how would you like to proceed.

Just to clarify: we (the wmcs team) are in agreement that we should spend the money and buy a new drive. No objections if y'all want to mess with firmware upgrades first.