Page MenuHomePhabricator

[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047
Closed, InvalidPublic

Description

Alert manager just now reported kernel errors on cloudvirt0147. For once, this is not the result of a recent reboot, as the host has been up for 4 days.

[385127.260440] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[385127.260447] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[385127.260450] {1}[Hardware Error]: event severity: corrected
[385127.260454] {1}[Hardware Error]:  Error 0, type: corrected
[385127.260457] {1}[Hardware Error]:  fru_text: A11
[385127.260460] {1}[Hardware Error]:   section_type: memory error
[385127.260463] {1}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[385127.260469] {1}[Hardware Error]:   physical_address: 0x00000013d001af00
[385127.260478] {1}[Hardware Error]:   node:1 card:1 module:1 rank:1 bank:0 device:16 row:38160 column:176 
[385127.260482] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[385127.260488] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[385127.260505] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[385127.260509] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[385127.260513] {2}[Hardware Error]: event severity: corrected
[385127.260517] {2}[Hardware Error]:  Error 0, type: corrected
[385127.260522] {2}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[385127.260527] {2}[Hardware Error]:   section length: 0x38
[385127.260536] {2}[Hardware Error]:   00000000: 01010001 00000000 d001a000 00000013  ................
[385127.260543] {2}[Hardware Error]:   00000010: 00001000 00000000 d001afff 00000013  ................
[385127.260549] {2}[Hardware Error]:   00000020: 00000080 00000000 00000000 00000000  ................
[385127.260553] {2}[Hardware Error]:   00000030: 00000000 00000000                    ........
[385127.266397] mce: [Hardware Error]: Machine check events logged
[385127.266414] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[385127.266417] EDAC skx MC1: CPU 0: Machine Check Event: 0x0 Bank 255: 0x9c0000000000009f
[385127.266423] EDAC skx MC1: TSC 0x0 
[385127.266426] EDAC skx MC1: ADDR 0x13d001af00 
[385127.266428] EDAC skx MC1: MISC 0x8c 
[385127.266431] EDAC skx MC1: PROCESSOR 0:0x50657 TIME 1739254674 SOCKET 0 APIC 0x0
[385127.266455] EDAC MC1: 0 CE memory read error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0x13d001a offset:0xf00 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f ProcessorSocketId:0x0 MemoryControllerId:0x1 PhysicalRankId:0x1 Row:0x9510 Column:0xb0 Bank:0x0 BankGroup:0x0 retry_rd_err_log[0001a20d 00000000 00000020 042c0150 00009510] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[385559.196269] Process accounting resumed

Details

Event Timeline

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-11T11:24:03Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1047' (T386083)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-11T11:24:31Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1047' (T386083)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-11T11:32:21Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1047.eqiad.wmnet' (T386083)

It has been corrected by h/w and requires no further action

Does this mean we should not worry about it? Or does it still indicate some underlying problem?

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-11T11:47:27Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1047.eqiad.wmnet' (T386083)

It has been corrected by h/w and requires no further action

Does this mean we should not worry about it? Or does it still indicate some underlying problem?

I don't know. It sounds to me like the sign of an underlying/developing issue but I'll try to get a second opinion.

Browsing stack overflow implies that this is likely an impending HW issue

Change #1119123 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] kernel-messages: add category=keyword_error

https://gerrit.wikimedia.org/r/1119123

Change #1119123 merged by FNegri:

[operations/puppet@production] kernel-messages: add category=keyword_error

https://gerrit.wikimedia.org/r/1119123

After looking into this, it seems it was a small glitch with the memory, however, it's been corrected by the ECC. Logged into the unit and everything seemed fine. Closing this for now, but if it comes up again, we will swap out the DIMM.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-20T17:25:16Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T386083)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-20T17:25:25Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T386083)

Andrew changed the task status from Resolved to Invalid.Feb 20 2025, 5:25 PM

I'm putting this host back in service and closing the ticket for now. Will follow up if it continues to alert.