Page MenuHomePhabricator

labstore1006 spontaneous reboot
Closed, ResolvedPublic

Description

Something went wrong on labstore1006 at Thu Nov 19 21:03:30 UTC 2020
It clearly rebooted itself. Nothing in logs so far, but there is a failed drive. If we can find a root cause or something for DCops, we can add them.

Event Timeline

Just before it died there were several things like this:

Nov 19 20:59:58 labstore1006 kernel: [21675498.806396] hpsa 0000:08:00.0: Command timed out.
Nov 19 20:59:58 labstore1006 kernel: [21675498.806410] hpsa 0000:08:00.0: hpsa1: hpsa_update_device_info: can't get device id for host 1:C0:T-1:L-1     Direct-Access           MB6000JVYYV

The last gasp:

Nov 19 21:01:15 labstore1006 rsyncd[20596]: rsync on dumpslastfive/ from 96-90-175-164-static.hfc.comcastbusiness.net (96.90.175.164)
Nov 19 21:01:15 labstore1006 rsyncd[20596]: building file list
Nov 19 21:01:28 labstore1006 ulogd[7093]: [fw-in-drop] IN=eth0 OUT= MAC=e0:07:1b:f0:9b:28:5c:5e:ab:3d:87:c1:08:00 SRC=113.53.231.34 DST=208.80.154.7 LEN=40 TOS=00 PREC=0x00 TTL=243 ID=8963 PROTO=TCP SPT=51505 DPT=445 SEQ=3975657657 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Nov 19 21:01:31 labstore1006 kernel: [21675590.965724] hpsa 0000:08:00.0: Command timed out.
Nov 19 21:01:31 labstore1006 kernel: [21675590.965732] hpsa 0000:08:00.0: hpsa_update_device_info: inquiry failed, device will be skipped.
Nov 19 21:01:35 labstore1006 ulogd[7093]: [fw-in-drop] IN=eth0 OUT= MAC=e0:07:1b:f0:9b:28:5c:5e:ab:3d:87:c1:08:00 SRC=193.27.229.86 DST=208.80.154.7 LEN=40 TOS=00 PREC=0x00 TTL=249 ID=8466 PROTO=TCP SPT=41214 DPT=5200 SEQ=3399069509 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Nov 19 21:01:48 labstore1006 ulogd[7093]: [fw-in-drop] IN=eth0 OUT= MAC=e0:07:1b:f0:9b:28:5c:5e:ab:3d:87:c1:08:00 SRC=59.127.155.116 DST=208.80.154.7 LEN=40 TOS=0C PREC=0x00 TTL=48 ID=38032 PROTO=TCP SPT=56704 DPT=8080 SEQ=3494943239 ACK=0 WINDOW=353 SYN URGP=0 MARK=0
Nov 19 21:01:54 labstore1006 ulogd[7093]: [fw-in-drop] IN=eth0 OUT= MAC=e0:07:1b:f0:9b:28:5c:5e:ab:3d:87:c1:08:00 SRC=193.27.229.86 DST=208.80.154.7 LEN=40 TOS=00 PREC=0x00 TTL=249 ID=39808 PROTO=TCP SPT=41214 DPT=5124 SEQ=391452035 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Nov 19 21:02:01 labstore1006 CRON[20616]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Nov 19 21:02:01 labstore1006 kernel: [21675621.685517] hpsa 0000:08:00.0: Command timed out.
Nov 19 21:02:18 labstore1006 ulogd[7093]: [fw-in-drop] IN=eth0 OUT= MAC=e0:07:1b:f0:9b:28:5c:5e:ab:3d:87:c1:08:00 SRC=194.26.25.123 DST=208.80.154.7 LEN=40 TOS=00 PREC=0x00 TTL=247 ID=32889 PROTO=TCP SPT=52071 DPT=6040 SEQ=1584926182 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0

Change 642144 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps: fail over dumps web

https://gerrit.wikimedia.org/r/642144

Change 642156 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps-dist: fail over labstore1006 to 1007

https://gerrit.wikimedia.org/r/642156

If the ticket wasn't auto-created for it. The failed drive is Port: 1E, box:2, bay: 10 (SAS) according to ILO

Change 642156 merged by Andrew Bogott:
[operations/puppet@production] dumps-dist: fail over labstore1006 to 1007

https://gerrit.wikimedia.org/r/642156

Change 642144 merged by Andrew Bogott:
[operations/dns@master] dumps: fail over dumps web

https://gerrit.wikimedia.org/r/642144

"an unrecoverable system error (NMI) has occurred. (Service Information: 0x00CC47F0, 0x00CC4AF0)"

Bstorm added a subscriber: Jclark-ctr.

According to T268285: update RAID controller firmware on labstore1006, 1007, we are already on recent firmware with regard to this issue. I'd briefly discussed involving HPE to get a fix with @Jclark-ctr back on that ticket, but I'm not sure that was done or if we have a service agreement/warranty either way.

So at this point, this has been in a failover state for a couple months. The last time this happened we gave up and failed back (and it happened again). I believe the warranty expired in 2020, so the opportunity to fix this on the last round of sudden reboots is already gone. That might not leave us with much. The system is strained while in a failover state, but it has no automatic HA.

@wiki_willy do you see any way forward here?

labstore1006 is the live web host again. If we need to take it down for troubleshooting, please coordinate with WMCS.

nskaggs claimed this task.