2022-08-20 NodeDown: cloudvirt1023
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	phaultfinder
	Aug 20 2022, 7:02 AM

Description

Common information

dashboard: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1023
description: Cloudvirt node cloudvirt1023 is down.
runbook: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown
summary: #page The cloudvirt node cloudvirt1023 is unreachable. This is a

major outage that requires you to either restore the server or evacuate manually the VMs on it.

alertname: NodeDown
cluster: wmcs
instance: cloudvirt1023:9100
job: node
prometheus: ops
severity: page
site: eqiad
source: prometheus
team: wmcs

Firing alerts

dashboard: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1023
description: Cloudvirt node cloudvirt1023 is down.
runbook: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown
summary: #page The cloudvirt node cloudvirt1023 is unreachable. This is a

major outage that requires you to either restore the server or evacuate manually the VMs on it.

alertname: NodeDown
cluster: wmcs
instance: cloudvirt1023:9100
job: node
prometheus: ops
severity: page
site: eqiad
source: prometheus
team: wmcs
Source

Related Objects
Search...

Status	Assigned	Task
Resolved	dcaro	T315718 2022-08-20 NodeDown: cloudvirt1023
Resolved	Giftpflanze	T315720 DWL irc-buster VM did not come back up from unexpected cloudvirt reboot
Resolved	dcaro	T315961 Restore gifti and taxonbot directories from irc-buster VM from DWL project

Event Timeline

phaultfinder created this task.Aug 20 2022, 7:02 AM

Restricted Application added subscribers: dcaro, Aklapper. · View Herald TranscriptAug 20 2022, 7:02 AM

dcaro changed the task status from Open to In Progress.Aug 20 2022, 7:10 AM

dcaro claimed this task.

dcaro added a project: User-dcaro.

dcaro added a project: Cloud-Services-Worktype-Unplanned.

dcaro added a project: Cloud-Services-Origin-Alert.

dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:23:36Z] <dcaro_away> cloudvirt1023 seems to have gotten some hardware issue from racadm lclog view "System CPU Resetting.", rebooting and doing memory checks (T315718)

The node is unreachable through ssh:

dcaro@vulcanus$ wm-ssh cloudvirt1023
INFO:wm-ssh:Found full hostname cloudvirt1023.eqiad.wmnet
channel 0: open failed: connect failed: Connection timed out
stdio forwarding failed
kex_exchange_identification: Connection closed by remote host

And unpingable from others:

root@cloudvirt1024:~# ping cloudvirt1023
PING cloudvirt1023.eqiad.wmnet (10.64.20.42) 56(84) bytes of data.
From cloudvirt1024.eqiad.wmnet (10.64.20.43) icmp_seq=1 Destination Host Unreachable

Console shows an empty screen:

/admin1->console com2

<blank even when hitting enter>

The lclog shows some interesting stuff though:

/admin1->racadm lclog view
...
--------------------------------------------------------------------------------
SeqNumber       = 1928
Message ID      = CPU9000
Category        = System
AgentID         = SEL
Severity        = Information
Timestamp       = 2022-08-20 06:58:36
Message         = An OEM diagnostic event occurred.
RawEventData    = 0x61,0x00,0x02,0x14,0x86,0x00,0x63,0xB1,0x00,0x04,0xC1,0x28,0x7E,0x00,0x86,0x00
FQDD            = System.Embedded.1
### Many of those
...
--------------------------------------------------------------------------------
SeqNumber       = 1927
Message ID      = NIC101
Category        = System
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2022-08-20 06:58:35
Message         = The NIC Slot 3 Port 2 network link is started.
Message Arg   1 = NIC Slot 3
Message Arg   2 = 2
FQDD            = NIC.Slot.3-2-1
--------------------------------------------------------------------------------
SeqNumber       = 1926
Message ID      = NIC101
Category        = System
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2022-08-20 06:58:35
Message         = The NIC Slot 3 Port 1 network link is started.
Message Arg   1 = NIC Slot 3
Message Arg   2 = 1
FQDD            = NIC.Slot.3-1-1
--------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------
SeqNumber       = 1920
Message ID      = CPU0704
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-08-20 06:58:29
Message         = CPU 2 machine check error detected.
Message Arg   1 = 2
RawEventData    = 0x5B,0x00,0x02,0x13,0x86,0x00,0x63,0xB1,0x00,0x04,0x07,0x0D,0x07,0xA6,0x02,0x60
FQDD            = CPU.Socket.1
--------------------------------------------------------------------------------
SeqNumber       = 1919
Message ID      = UEFI0079
Category        = System
AgentID         = iDRAC
Severity        = Critical
Timestamp       = 2022-08-20 06:58:28
Message         = One or more Uncorrectable Memory errors occurred in the previous boot.
--------------------------------------------------------------------------------
SeqNumber       = 1918
Message ID      = MEM0001
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2022-08-20 06:58:28
Message         = Multi-bit memory errors detected on a memory device at location(s) DIMM_A9.
Message Arg   1 = DIMM_A9
RawEventData    = 0x5A,0x00,0x02,0x13,0x86,0x00,0x63,0xB1,0x00,0x04,0x0C,0x02,0x6F,0x11,0xE1,0x01

FQDD            = DIMM.Socket.A9
--------------------------------------------------------------------------------
SeqNumber       = 1917
Message ID      = PST0090
Category        = System
AgentID         = SEL
Severity        = Information
Timestamp       = 2022-08-20 06:58:27
Message         = A problem was detected related to the previous server boot.
RawEventData    = 0x59,0x00,0x02,0x13,0x86,0x00,0x63,0xB1,0x00,0x04,0xC1,0x2E,0x72,0xA2,0x02,0x00

FQDD            = System.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 1916
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2022-08-20 06:55:51
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------

What point s to the dimm on A9 having had some issues, will reboot and do some memory checks.

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:39:56Z] <dcaro_away> cloudvirt1023 is back up, VMs are starting to recover (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:41:41Z] <dcaro_away> cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:43:32Z] <dcaro_away> rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:44:07Z] <dcaro_away> all k8s nodes ready now \o/ (T315718)

dcaro renamed this task from NodeDown to 2022-08-20 NodeDown: cloudvirt1023.Aug 20 2022, 7:46 AM

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T07:55:27Z] <dcaro_away> after cloudvirt1023 reboot, the vm irc-buster does not seem to have rebooted correctly (no ssh, no console), rebooting (T315718)

Mentioned in SAL (#wikimedia-cloud) [2022-08-20T08:04:40Z] <dcaro_away> after cloudvirt1023 reboot, the vm irc-buster shows as running, but even after restart is not responsive through ssh nor console (T315718)

dcaro added a subtask: T315720: DWL irc-buster VM did not come back up from unexpected cloudvirt reboot.Aug 20 2022, 8:13 AM

RhinosF1 subscribed.Aug 20 2022, 8:14 AM

dcaro changed the task status from In Progress to Open.Aug 23 2022, 8:13 AM

dcaro moved this task from Doing to Today on the User-dcaro board.

Giftpflanze closed subtask T315720: DWL irc-buster VM did not come back up from unexpected cloudvirt reboot as Resolved.Aug 23 2022, 1:34 PM

dcaro closed this task as Resolved.Aug 23 2022, 2:42 PM

dcaro moved this task from Today to Done on the User-dcaro board.

2022-08-20 NodeDown: cloudvirt1023Closed, ResolvedPublicActions