Page MenuHomePhabricator

2022-08-20 NodeDown: cloudvirt1023
Closed, ResolvedPublic

Description

Common information

major outage that requires you to either restore the server or evacuate manually the VMs on it.

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1023:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


major outage that requires you to either restore the server or evacuate manually the VMs on it.

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1023:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs
  • Source

Event Timeline

dcaro changed the task status from Open to In Progress.Aug 20 2022, 7:10 AM
dcaro claimed this task.
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:23:36Z] <dcaro_away> cloudvirt1023 seems to have gotten some hardware issue from racadm lclog view "System CPU Resetting.", rebooting and doing memory checks (T315718)

The node is unreachable through ssh:

dcaro@vulcanus$ wm-ssh cloudvirt1023
INFO:wm-ssh:Found full hostname cloudvirt1023.eqiad.wmnet
channel 0: open failed: connect failed: Connection timed out
stdio forwarding failed
kex_exchange_identification: Connection closed by remote host

And unpingable from others:

root@cloudvirt1024:~# ping cloudvirt1023
PING cloudvirt1023.eqiad.wmnet (10.64.20.42) 56(84) bytes of data.
From cloudvirt1024.eqiad.wmnet (10.64.20.43) icmp_seq=1 Destination Host Unreachable

Console shows an empty screen:

/admin1->console com2

<blank even when hitting enter>

The lclog shows some interesting stuff though:

/admin1->racadm lclog view
...
--------------------------------------------------------------------------------
SeqNumber       = 1928
Message ID      = CPU9000
Category        = System
AgentID         = SEL
Severity        = Information
Timestamp       = 2022-08-20 06:58:36
Message         = An OEM diagnostic event occurred.
RawEventData    = 0x61,0x00,0x02,0x14,0x86,0x00,0x63,0xB1,0x00,0x04,0xC1,0x28,0x7E,0x00,0x86,0x00
FQDD            = System.Embedded.1
### Many of those
...
--------------------------------------------------------------------------------
SeqNumber       = 1927
Message ID      = NIC101
Category        = System
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2022-08-20 06:58:35
Message         = The NIC Slot 3 Port 2 network link is started.
Message Arg   1 = NIC Slot 3
Message Arg   2 = 2
FQDD            = NIC.Slot.3-2-1
--------------------------------------------------------------------------------
SeqNumber       = 1926
Message ID      = NIC101
Category        = System
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2022-08-20 06:58:35
Message         = The NIC Slot 3 Port 1 network link is started.
Message Arg   1 = NIC Slot 3
Message Arg   2 = 1
FQDD            = NIC.Slot.3-1-1
--------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------
SeqNumber       = 1920
Message ID      = CPU0704
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-08-20 06:58:29
Message         = CPU 2 machine check error detected.
Message Arg   1 = 2
RawEventData    = 0x5B,0x00,0x02,0x13,0x86,0x00,0x63,0xB1,0x00,0x04,0x07,0x0D,0x07,0xA6,0x02,0x60
FQDD            = CPU.Socket.1
--------------------------------------------------------------------------------
SeqNumber       = 1919
Message ID      = UEFI0079
Category        = System
AgentID         = iDRAC
Severity        = Critical
Timestamp       = 2022-08-20 06:58:28
Message         = One or more Uncorrectable Memory errors occurred in the previous boot.
--------------------------------------------------------------------------------
SeqNumber       = 1918
Message ID      = MEM0001
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-08-20 06:58:28
Message         = Multi-bit memory errors detected on a memory device at location(s) DIMM_A9.
Message Arg   1 = DIMM_A9
RawEventData    = 0x5A,0x00,0x02,0x13,0x86,0x00,0x63,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE1,0x01

FQDD            = DIMM.Socket.A9
--------------------------------------------------------------------------------
SeqNumber       = 1917
Message ID      = PST0090
Category        = System
AgentID         = SEL
Severity        = Information
Timestamp       = 2022-08-20 06:58:27
Message         = A problem was detected related to the previous server boot.
RawEventData    = 0x59,0x00,0x02,0x13,0x86,0x00,0x63,0xB1,0x00,0x04,0xC1,0x2E,0x72,0xA2,0x02,0x00

FQDD            = System.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 1916
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2022-08-20 06:55:51
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------

What point s to the dimm on A9 having had some issues, will reboot and do some memory checks.

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:39:56Z] <dcaro_away> cloudvirt1023 is back up, VMs are starting to recover (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:41:41Z] <dcaro_away> cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:43:32Z] <dcaro_away> rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:44:07Z] <dcaro_away> all k8s nodes ready now \o/ (T315718)

dcaro renamed this task from NodeDown to 2022-08-20 NodeDown: cloudvirt1023.Aug 20 2022, 7:46 AM

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:55:27Z] <dcaro_away> after cloudvirt1023 reboot, the vm irc-buster does not seem to have rebooted correctly (no ssh, no console), rebooting (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T08:04:40Z] <dcaro_away> after cloudvirt1023 reboot, the vm irc-buster shows as running, but even after restart is not responsive through ssh nor console (T315718)

dcaro changed the task status from In Progress to Open.Aug 23 2022, 8:13 AM
dcaro moved this task from Doing to Today on the User-dcaro board.
dcaro moved this task from Today to Done on the User-dcaro board.