Page MenuHomePhabricator

labstore1007 crashed after storage controller errors--replace disk?
Closed, ResolvedPublic

Description

Received this alert:

Notification Type: PROBLEM
Host: labstore1007
State: DOWN
Address: 208.80.155.106
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Sat Apr 24 22:12:06 UTC 2021

rebooting from the console (which is up) after checking no serious errors in iLO and unable to get a root console to fully log in

A slow-failing disk could be part of the problem (there was an alert on Tue Apr 27 11:00:28 UTC 2021 for device 1I:1:5, but it recovered quickly on its own. Since resetting the logical volume is the only clearly logged system, a flapping drive could be part of it.

Event Timeline

Bstorm triaged this task as Unbreak Now! priority.Apr 24 2021, 10:26 PM
Bstorm created this task.
Bstorm lowered the priority of this task from Unbreak Now! to High.Apr 24 2021, 10:30 PM

System is back up after forcing reboot. Checking logs for a root cause.

Found this around the time of the alert:

Apr 24 22:06:51 labstore1007 kernel: [35172822.696908] hpsa 0000:05:00.0: aborted: LUN:000000c000000001 CDB:12010000040000000000000000000000
Apr 24 22:06:51 labstore1007 kernel: [35172822.696917] hpsa 0000:05:00.0: hpsa0: hpsa_update_device_info: can't get device id for host 0:C0:T0:L0       Direct-Acc
ess           MB6000GVYYU
Apr 24 22:06:51 labstore1007 kernel: [35172822.873507] hpsa 0000:05:00.0: scsi 0:0:1:0: replaced Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPath
Cap- En- Exp=0
Apr 24 22:07:00 labstore1007 kernel: [35172831.623472] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0)
SSDSmartPathCap- En- Exp=1
Apr 24 22:07:32 labstore1007 kernel: [35172864.122262] hpsa 0000:05:00.0: Command timed out.
Apr 24 22:07:32 labstore1007 kernel: [35172864.122273] hpsa 0000:05:00.0: hpsa0: hpsa_update_device_info: can't get device id for host 0:C0:T0:L0       Direct-Acc
ess           MB6000GVYYU
Apr 24 22:07:34 labstore1007 kernel: [35172865.534256] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL V
OLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1

That's not cheery

Just before that was:

Apr 24 22:05:36 labstore1007 kernel: [35172747.658395] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
Apr 24 22:05:37 labstore1007 kernel: [35172748.413472] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL V
OLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1

Going to need to check the storage, but that shouldn't crash the system in an ideal world? At root login on console, it basically hung starting a shell, so the root vols might be what this is all about. Will know more later.

root@labstore1007:~# /usr/sbin/hpssacli ctrl all show status

Smart Array P441 in Slot 3
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

The controllers are happy enough.

All the disks report they are fine (of course).

1root@labstore1007:~# hpssacli ctrl slot=3 pd all show status
2
3 physicaldrive 1E:1:1 (port 1E:box 1:bay 1, 6001.1 GB): OK
4 physicaldrive 1E:1:2 (port 1E:box 1:bay 2, 6001.1 GB): OK
5 physicaldrive 1E:1:3 (port 1E:box 1:bay 3, 6001.1 GB): OK
6 physicaldrive 1E:1:4 (port 1E:box 1:bay 4, 6001.1 GB): OK
7 physicaldrive 1E:1:5 (port 1E:box 1:bay 5, 6001.1 GB): OK
8 physicaldrive 1E:1:6 (port 1E:box 1:bay 6, 6001.1 GB): OK
9 physicaldrive 1E:1:7 (port 1E:box 1:bay 7, 6001.1 GB): OK
10 physicaldrive 1E:1:8 (port 1E:box 1:bay 8, 6001.1 GB): OK
11 physicaldrive 1E:1:9 (port 1E:box 1:bay 9, 6001.1 GB): OK
12 physicaldrive 1E:1:10 (port 1E:box 1:bay 10, 6001.1 GB): OK
13 physicaldrive 1E:1:11 (port 1E:box 1:bay 11, 6001.1 GB): OK
14 physicaldrive 1E:1:12 (port 1E:box 1:bay 12, 6001.1 GB): OK
15 physicaldrive 1E:2:1 (port 1E:box 2:bay 1, 6001.1 GB): OK
16 physicaldrive 1E:2:2 (port 1E:box 2:bay 2, 6001.1 GB): OK
17 physicaldrive 1E:2:3 (port 1E:box 2:bay 3, 6001.1 GB): OK
18 physicaldrive 1E:2:4 (port 1E:box 2:bay 4, 6001.1 GB): OK
19 physicaldrive 1E:2:5 (port 1E:box 2:bay 5, 6001.1 GB): OK
20 physicaldrive 1E:2:6 (port 1E:box 2:bay 6, 6001.1 GB): OK
21 physicaldrive 1E:2:7 (port 1E:box 2:bay 7, 6001.1 GB): OK
22 physicaldrive 1E:2:8 (port 1E:box 2:bay 8, 6001.1 GB): OK
23 physicaldrive 1E:2:9 (port 1E:box 2:bay 9, 6001.1 GB): OK
24 physicaldrive 1E:2:10 (port 1E:box 2:bay 10, 6001.1 GB): OK
25 physicaldrive 1E:2:11 (port 1E:box 2:bay 11, 6001.1 GB): OK
26 physicaldrive 1E:2:12 (port 1E:box 2:bay 12, 6001.1 GB): OK
27
28root@labstore1007:~# hpssacli ctrl slot=3 ld all show status
29
30 logicaldrive 1 (32.7 TB, 1+0): OK
31 logicaldrive 2 (32.7 TB, 1+0): OK
32
33root@labstore1007:~# hpssacli ctrl slot=1 pd all show status
34
35 physicaldrive 2I:4:1 (port 2I:box 4:bay 1, 1 TB): OK
36 physicaldrive 2I:4:2 (port 2I:box 4:bay 2, 1 TB): OK
37 physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 6001.1 GB): OK
38 physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 6001.1 GB): OK
39 physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 6001.1 GB): OK
40 physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 6001.1 GB): OK
41 physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 6001.1 GB): OK
42 physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 6001.1 GB): OK
43 physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 6001.1 GB): OK
44 physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 6001.1 GB): OK
45 physicaldrive 2I:2:1 (port 2I:box 2:bay 1, 6001.1 GB): OK
46 physicaldrive 2I:2:2 (port 2I:box 2:bay 2, 6001.1 GB): OK
47 physicaldrive 2I:2:3 (port 2I:box 2:bay 3, 6001.1 GB): OK
48 physicaldrive 2I:2:4 (port 2I:box 2:bay 4, 6001.1 GB): OK
49
50root@labstore1007:~# hpssacli ctrl slot=1 ld all show status
51
52 logicaldrive 1 (931.5 GB, 1): OK
53 logicaldrive 2 (32.7 TB, 1+0): OK

Ok, the PCI address logging the errors was 0000:05:00.0, which corresponds to the controller in slot 1 from the controller details:

Smart Array P840 in Slot 1
   Bus Interface: PCI
   Slot: 1
   Serial Number: PDNNF0ARH7306N
   Cache Serial Number: PEYFP0BRH7323Q

...

PCI Address (Domain:Bus:Device.Function): 0000:05:00.0

So, that helps a bit.

This Apr 24 22:05:36 labstore1007 kernel: [35172747.658395] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap- En- Exp=1 would then refer to the larger data volume (because that's raid 10) from the paste (logical volume 2).

The system thinks that is all good. I'm starting to think the controller flapped.

array B

   Logical Drive: 2
      Size: 32.7 TB
      Fault Tolerance: 1+0
      Heads: 255
      Sectors Per Track: 63
      Cylinders: 65535
      Strip Size: 256 KB
      Full Stripe Size: 1536 KB
      Status: OK
      MultiDomain Status: OK
      Caching:  Enabled
      Unique Identifier: 600508B1001C778B7DE10D3778FE5B36
      Disk Name: /dev/sdb
      Mount Points: None
      Logical Drive Label: 05C50128PDNNF0ARH7306N7C41
      Mirror Group 1:
         physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 6001.1 GB, OK)
      Mirror Group 2:
         physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 6001.1 GB, OK)
         physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SATA, 6001.1 GB, OK)
         physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SATA, 6001.1 GB, OK)
      Drive Type: Data
      LD Acceleration Method: Controller Cache

Ok, my shell just hung for a bit, and I see this in dmesg :(

[Sat Apr 24 22:31:40 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:32:23 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:32:23 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:05 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:09 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:36:09 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:48 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:36:59 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:36:59 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:38:23 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:38:26 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:38:26 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:40:37 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:40:48 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:40:48 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:43:09 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:43:10 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:43:10 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:44:34 2021] hpsa 0000:05:00.0: aborted: LUN:000000c000000001 CDB:12000000310000000000000000000000
[Sat Apr 24 22:44:34 2021] hpsa 0000:05:00.0: hpsa_update_device_info: inquiry failed, device will be skipped.
[Sat Apr 24 22:44:34 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: removed Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 22:44:42 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:44:57 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:44:57 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:44:58 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: masked Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 22:50:39 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:51:37 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:51:37 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:57:01 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:57:23 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 22:57:23 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 22:59:59 2021] hpsa 0000:05:00.0: aborted: LUN:000000c000000001 CDB:12010000090000000000000000000000
[Sat Apr 24 22:59:59 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: replaced Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 23:00:02 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 23:00:38 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 23:00:38 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 23:00:41 2021] hpsa 0000:05:00.0: scsi 0:0:1:0: replaced Direct-Access     ATA      MB6000GVYYU      PHYS DRV SSDSmartPathCap- En- Exp=0
[Sat Apr 24 23:02:05 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
[Sat Apr 24 23:02:08 2021] hpsa 0000:05:00.0: device is ready.
[Sat Apr 24 23:02:08 2021] hpsa 0000:05:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1

Going to fail over to labstore1006 because this looks suspect.

Change 682273 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] dumps-distribution: fail over cloud NFS primary to labstore1006

https://gerrit.wikimedia.org/r/682273

Change 682273 merged by Bstorm:

[operations/puppet@production] dumps-distribution: fail over cloud NFS primary to labstore1006

https://gerrit.wikimedia.org/r/682273

Bstorm renamed this task from labstore1007 not reachable on main interface (ping failure) to labstore1007 crashed after storage controller errors.Apr 24 2021, 11:18 PM

Ok, dumps is going to be served from labstore1006 for awhile. I've downloaded an "adu" report with the hpssaducli tools (hpssaducli -f adu-report.zip as root). I'm pleased that's readable and will go over it Monday to see if there's a useful report on what happened in there since this looks like hardware.

Unfortunately I'm not seeing anything in there.

Mentioned in SAL (#wikimedia-operations) [2021-04-27T12:00:33Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045

Mentioned in SAL (#wikimedia-operations) [2021-04-27T12:00:43Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045

There was an alert today about device 1I:1:5, but by the time I checked, the device was fine:

aborrero@labstore1007:~ $ sudo hpssacli ctrl slot=3 pd all show status

   physicaldrive 1E:1:1 (port 1E:box 1:bay 1, 6001.1 GB): OK
   physicaldrive 1E:1:2 (port 1E:box 1:bay 2, 6001.1 GB): OK
   physicaldrive 1E:1:3 (port 1E:box 1:bay 3, 6001.1 GB): OK
   physicaldrive 1E:1:4 (port 1E:box 1:bay 4, 6001.1 GB): OK
   physicaldrive 1E:1:5 (port 1E:box 1:bay 5, 6001.1 GB): OK    <--------------
[..]
aborrero added a parent task: Unknown Object (Task).Apr 27 2021, 12:06 PM

That alert may actually be actionable at least. It would seem surprising for a failing disk to crash a server, but I've seen it before if it is resetting like mad. Let's see if we can get that replaced.

Bstorm renamed this task from labstore1007 crashed after storage controller errors to labstore1007 crashed after storage controller errors--replace disk?.Apr 27 2021, 3:17 PM
Bstorm updated the task description. (Show Details)
Bstorm added a project: ops-eqiad.

@wiki_willy @Bstorm host is out of warranty expired Jun 1, 2020. We do not have any 6tb hard drives how would you like to proceed.

Just to clarify: we (the wmcs team) are in agreement that we should spend the money and buy a new drive. No objections if y'all want to mess with firmware upgrades first.

Paged again just now for the same drive (cluster=wmcs device=1I:1:5 instance=labstore1007 job=node site=eqiad). I'd really like to get that replaced. That may resolve the problem in general since it looks like it's a reset pattern of the logical drive caused by hardware behavior. Can we please order a drive before worrying about the firmware @Jclark-ctr, @wiki_willy?
This host is out of service, creating a single point of failure for all of https://dumps.wikimedia.org until I can get this back working.

I can manually fail it out and turn on the LED whenever someone is ready to swap it. The array is old enough to lose another disk, so I'm holding off just in case.

Mentioned in SAL (#wikimedia-cloud) [2021-05-22T02:14:12Z] <bstorm> downtiming SMART alerts on dumps server labstore1007 for the weekend because it has been flapping T281045

Hi @Jclark-ctr - let me know how you want to proceed

@wiki_willy we need to order drive

Hi @Jclark-ctr - let me know how you want to proceed

Mentioned in SAL (#wikimedia-operations) [2021-05-25T00:36:58Z] <bstorm> labstore1007 downtimed for maintenance T281045

wiki_willy added a subtask: Unknown Object (Task).May 25 2021, 5:50 PM

T283618 created to order the replacement part. @Bstorm or @aborrero - one of you might need to confirm the drive size though, to make sure we're getting the right one. Thanks, Willy

It's this drive physicaldrive 1E:1:5 (port 1E:box 1:bay 5, 6001.1 GB)
Is that enough?

This may be a silly qustion, but might it not be a good idea to have a few spare drives on hand for labstore1006/7 until they are refreshed, whenever that may be?

Mentioned in SAL (#wikimedia-operations) [2021-05-26T10:01:50Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 180 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045

Mentioned in SAL (#wikimedia-operations) [2021-05-26T10:01:54Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 180 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045

physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 6001.1 GB): Rebuilding

Looking good.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jun 29 2021, 10:41 PM

Disk is OK now. I'm going to hope that was it!