Page MenuHomePhabricator

2024-08-31 cloudvirt1048 NodeDown because memory hardware error
Closed, ResolvedPublic

Description

Common information

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1048:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Related Objects

StatusSubtypeAssignedTask
ResolvedVRiley-WMF

Event Timeline

I rebooted the host via mgmt and it came back up and seems fine.

2024-08-31T13:00:06.579609+00:00 cloudvirt1048 kernel: [6222225.314271] Disabling lock debugging due to kernel taint
2024-08-31T13:00:06.579675+00:00 cloudvirt1048 kernel: [6222225.314304] mce: [Hardware Error]: Machine check events logged
2024-08-31T13:00:06.579677+00:00 cloudvirt1048 kernel: [6222225.314308] mce: [Hardware Error]: CPU 39: Machine Check Exception: 7 Bank 1: bd80000000100134
2024-08-31T13:00:06.579692+00:00 cloudvirt1048 kernel: [6222225.314778] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2024-08-31T13:00:06.582031+00:00 cloudvirt1048 neutron-openvswitch-agent: 2024-08-31 13:00:06.580 629348 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-24be7a57-c2e2-4c4b-ad51-d5431d414bab - - - - - -] Agent rpc_loop - iteration:242064 started
2024-08-31T13:00:06.584353+00:00 cloudvirt1048 neutron-openvswitch-agent: 2024-08-31 13:00:06.583 629348 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-24be7a57-c2e2-4c4b-ad51-d5431d414bab - - - - - -] Agent rpc_loop - iteration:242064 completed. Processed ports statistics: {'regular': {'added': 0, 'updated': 0, 'removed': 0}}. Elapsed:0.003
2024-08-31T13:00:06.587951+00:00 cloudvirt1048 kernel: [6222225.323094] mce: Uncorrected hardware memory error in user-access at 4d47b27e80
2024-08-31T13:00:06.587964+00:00 cloudvirt1048 kernel: [6222225.323108] mce: [Hardware Error]: TSC 395d2e45d242dc ADDR 4d47b27e80 MISC 86 PPIN 2a8f87bc2bdb1510
2024-08-31T13:00:06.587966+00:00 cloudvirt1048 kernel: [6222225.323113] mce: [Hardware Error]: PROCESSOR 0:50657 TIME 1725109206 SOCKET 1 APIC 49 microcode 5003604
2024-08-31T13:00:06.606885+00:00 cloudvirt1048 kernel: [6222225.331523] {1}[Hardware Error]: event severity: recoverable
2024-08-31T13:00:06.606901+00:00 cloudvirt1048 kernel: [6222225.331528] {1}[Hardware Error]:  Error 0, type: recoverable
2024-08-31T13:00:06.606902+00:00 cloudvirt1048 kernel: [6222225.331530] {1}[Hardware Error]:  fru_text: B2
2024-08-31T13:00:06.606904+00:00 cloudvirt1048 kernel: [6222225.331532] {1}[Hardware Error]:   section_type: memory error
2024-08-31T13:00:07.285276+00:00 cloudvirt1048 kernel: [6222225.980983] {1}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
2024-08-31T13:00:07.285299+00:00 cloudvirt1048 kernel: [6222225.990201] {1}[Hardware Error]:   physical_address: 0x0000004d47b27e80
2024-08-31T13:00:07.285302+00:00 cloudvirt1048 kernel: [6222225.996988] {1}[Hardware Error]:   node:2 card:1 module:0 rank:0 bank:2 device:0 row:27756 column:1016 
2024-08-31T13:00:07.285303+00:00 cloudvirt1048 kernel: [6222226.006545] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
2024-08-31T13:00:07.285305+00:00 cloudvirt1048 kernel: [6222226.012638] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
2024-08-31T13:00:07.286912+00:00 cloudvirt1048 kernel: [6222226.021570] mce: [Hardware Error]: Machine check events logged
2024-08-31T13:00:07.286933+00:00 cloudvirt1048 kernel: [6222226.021588] EDAC skx MC2: HANDLING MCE MEMORY ERROR
2024-08-31T13:00:07.286935+00:00 cloudvirt1048 kernel: [6222226.021589] EDAC skx MC2: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
2024-08-31T13:00:07.286937+00:00 cloudvirt1048 kernel: [6222226.021591] EDAC skx MC2: TSC 0x0 
2024-08-31T13:00:07.286939+00:00 cloudvirt1048 kernel: [6222226.021592] EDAC skx MC2: ADDR 0x4d47b27e80 
2024-08-31T13:00:07.286940+00:00 cloudvirt1048 kernel: [6222226.021593] EDAC skx MC2: MISC 0x8c 
2024-08-31T13:00:07.286942+00:00 cloudvirt1048 kernel: [6222226.021593] EDAC skx MC2: PROCESSOR 0:0x50657 TIME 1725109207 SOCKET 0 APIC 0x0
2024-08-31T13:00:07.286948+00:00 cloudvirt1048 kernel: [6222226.021615] EDAC MC2: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4d47b27 offset:0xe80 grain:32 -  err_code:0x0000:0x009f ProcessorSocketId:0x1 MemoryControllerId:0x0 PhysicalRankId:0x0 Row:0x6c6c Column:0x3f8 Bank:0x2 BankGroup:0x2 retry_rd_err_log[0001a80f 00000000 00000002 04fea000 00006c6c] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
2024-08-31T13:00:07.286959+00:00 cloudvirt1048 kernel: [6222226.021760] Memory failure: 0x4d47b27: Sending SIGBUS to CPU 3/KVM:3433839 due to hardware memory corruption
2024-08-31T13:00:07.306679+00:00 cloudvirt1048 kernel: [6222226.031821] Memory failure: 0x4d47b27: Sending SIGBUS to CPU 3/KVM:3433839 due to hardware memory corruption
2024-08-31T13:00:07.306698+00:00 cloudvirt1048 kernel: [6222226.041834] Memory failure: 0x4d47b27: recovery action for dirty LRU page: Recovered
2024-08-31T13:00:08.210868+00:00 cloudvirt1048 kernel: [6222226.936880] Memory failure: 0x4d47b27: already hardware poisoned
2024-08-31T13:00:08.210880+00:00 cloudvirt1048 kernel: [6222226.943938] EDAC skx MC2: HANDLING MCE MEMORY ERROR
2024-08-31T13:00:08.210880+00:00 cloudvirt1048 kernel: [6222226.943952] EDAC skx MC2: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
2024-08-31T13:00:08.210881+00:00 cloudvirt1048 kernel: [6222226.943955] EDAC skx MC2: TSC 0x0
2024-08-31T13:00:08.210881+00:00 cloudvirt1048 kernel: [6222226.943955] EDAC skx MC2: ADDR 0x4d47b27e80
2024-08-31T13:00:08.210882+00:00 cloudvirt1048 kernel: [6222226.943957] EDAC skx MC2: MISC 0x8c
2024-08-31T13:00:08.210882+00:00 cloudvirt1048 kernel: [6222226.943958] EDAC skx MC2: PROCESSOR 0:0x50657 TIME 1725109208 SOCKET 0 APIC 0x0
2024-08-31T13:00:08.210883+00:00 cloudvirt1048 kernel: [6222226.943980] EDAC MC2: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4d47b27 offset:0xe80 grain:32 -  err_code:0x0000:0x009f ProcessorSocketId:0x1 MemoryControllerId:0x0 PhysicalRankId:0x0 Row:0x6c6c Column:0x3f8 Bank:0x2 BankGroup:0x2 retry_rd_err_log[0001a80f 00000000 00000002 04fea000 00006c6c] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
2024-08-31T13:00:09.379153+00:00 cloudvirt1048 neutron-openvswitch-agent: 2024-08-31 13:00:09.375 629348 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-24be7a57-c2e2-4c4b-ad51-d5431d414bab - - - - - -] Agent rpc_loop - iteration:242065 started
2024-08-31T13:00:10.011202+00:00 cloudvirt1048 neutron-openvswitch-agent: 2024-08-31 13:00:10.006 629348 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-24be7a57-c2e2-4c4b-ad51-d5431d414bab - - - - - -] Agent rpc_loop - iteration:242065 - starting polling. Elapsed:0.631
2024-08-31T13:00:10.011535+00:00 cloudvirt1048 kernel: [6222228.110301] Memory failure: 0x4d47b27: already hardware poisoned
2024-08-31T13:00:10.011566+00:00 cloudvirt1048 kernel: [6222228.741373] mce: Uncorrected hardware memory error in user-access at 4f37ae7e80
2024-08-31T13:00:10.011568+00:00 cloudvirt1048 kernel: [6222228.743627] EDAC skx MC2: HANDLING MCE MEMORY ERROR
2024-08-31T13:00:10.016778+00:00 cloudvirt1048 kernel: [6222228.751516] EDAC skx MC2: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
2024-08-31T13:00:10.016798+00:00 cloudvirt1048 kernel: [6222228.751523] EDAC skx MC2: TSC 0x0
2024-08-31T13:00:10.016799+00:00 cloudvirt1048 kernel: [6222228.751524] EDAC skx MC2: ADDR 0x4d47b27e80
2024-08-31T13:00:10.016800+00:00 cloudvirt1048 kernel: [6222228.751525] EDAC skx MC2: MISC 0x8c
2024-08-31T13:00:10.016801+00:00 cloudvirt1048 kernel: [6222228.751527] EDAC skx MC2: PROCESSOR 0:0x50657 TIME 1725109209 SOCKET 0 APIC 0x0
2024-08-31T13:00:10.016801+00:00 cloudvirt1048 kernel: [6222228.751558] EDAC MC2: 1 UE memory read error on CPU_SrcID#1_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4d47b27 offset:0xe80 grain:32 -  err_code:0x0000:0x009f ProcessorSocketId:0x1 MemoryControllerId:0x0 PhysicalRankId:0x0 Row:0x6c6c Column:0x3f8 Bank:0x2 BankGroup:0x2 retry_rd_err_log[0001a80f 00000000 00000002 04fea000 00006c6c] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
2024-08-31T13:00:10.016998+00:00 cloudvirt1048 neutron-openvswitch-agent: 2024-08-31 13:00:10.015 629348 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-24be7a57-c2e2-4c4b-ad51-d5431d414bab - - - - - -] Agent rpc_loop - iteration:242065 - port information retrieved. Elapsed:0.640
2024-08-31T13:00:10.017073+00:00 cloudvirt1048 neutron-openvswitch-agent: 2024-08-31 13:00:10.016 629348 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-24be7a57-c2e2-4c4b-ad51-d5431d414bab - - - - - -] Agent rpc_loop - iteration:242065 completed. Processed ports statistics: {'regular': {'added': 0, 'updated': 0, 'removed': 0}}. Elapsed:0.640
2024-08-31T13:00:10.019287+00:00 cloudvirt1048 kernel: [6222228.752115] Memory failure: 0x4f37ae7: Sending SIGBUS to CPU 1/KVM:417195 due to hardware memory2024-08-31T13:42:39.962640+00:00 cloudvirt1048 systemd-modules-load[1005]: Inserted module 'br_netfilter'
2024-08-31T13:42:39.962847+00:00 cloudvirt1048 systemd-modules-load[1005]: Inserted module 'ipmi_devintf'
dcaro renamed this task from NodeDown to 2024-08-31 cloudvirt1048 NodeDown.Aug 31 2024, 1:49 PM

Mentioned in SAL (#wikimedia-cloud) [2024-08-31T13:55:18Z] <andrewbogott> moving tools-redis-7 off of cloudvirt1048 just in case T373740

aborrero renamed this task from 2024-08-31 cloudvirt1048 NodeDown to 2024-08-31 cloudvirt1048 NodeDown because memory hardware error.Sep 18 2024, 2:12 PM
aborrero added a project: ops-eqiad.
aborrero added subscribers: VRiley-WMF, aborrero.

hey @VRiley-WMF could you please advice what should we do with the memory error in this server?

Is there an acceptable time to swap out the DIMM? We can proceed at any time.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-19T15:31:28Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.set_maintenance (T373740)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-19T15:32:05Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=0) (T373740)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-19T15:33:13Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1048.eqiad.wmnet' (T373740)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-19T15:46:48Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1048.eqiad.wmnet' (T373740)

the server has been drained, it should be ready to go at any time @VRiley-WMF thanks!

This DIMM (B2) has been swapped out. Please let us know if any other issue crops up.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-20T08:27:03Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T373740)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-20T08:27:12Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T373740)

aborrero added a parent task: Unknown Object (Task).Sep 20 2024, 8:29 AM