Page MenuHomePhabricator

elastic2054 is having H/W issues
Closed, ResolvedPublic

Description

Seen in dmesg:

[Wed May 18 06:44:27 2022] {21}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Wed May 18 06:44:27 2022] {21}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed May 18 06:44:27 2022] {21}[Hardware Error]: event severity: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 0, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 1, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 2, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 3, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 4, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 5, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 6, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 7, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 8, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 9, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:  Error 10, type: corrected
[Wed May 18 06:44:27 2022] {21}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Wed May 18 06:44:27 2022] {22}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Wed May 18 06:44:27 2022] {22}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed May 18 06:44:27 2022] {22}[Hardware Error]: event severity: corrected
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:  Error 0, type: corrected
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:  fru_text: A2
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:   section_type: memory error
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:   error_status: 0x0000000000000400
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:   physical_address: 0x000000093e57f3c0
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:   node: 0 card: 1 module: 0 rank: 1 bank: 0 row: 7963 column: 1000 
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:   error_type: 2, single-bit ECC
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Wed May 18 06:44:28 2022] MCE: Killing elasticsearch[e:187641 due to hardware memory corruption fault at 7fc60e3c9000
[Wed May 18 06:46:24 2022] mce: [Hardware Error]: Machine check events logged

Event Timeline

dcausse renamed this task from elasticsearch2054 is having H/W issues to elastic2054 is having H/W issues.May 18 2022, 9:36 AM
dcausse added a project: ops-codfw.

Mentioned in SAL (#wikimedia-operations) [2022-05-18T09:46:43Z] <dcausse> T308647: banning elastic2054 from production-search-psi-codfw and elastic2054-production-search-codfw

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Backlog to Acknowledged on the SRE board.
Marostegui added subscribers: Papaul, Marostegui.

The idrac doesn't show anything:

/admin1/system1/logs1/log1-> show record1

	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record2

	properties
		CreationTimestamp = 20190813151903.000000-300
		ElementName = System Event Log Entry
		RecordData = Log cleared.
		RecordFormat = string Description
		RecordID = 1
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record3

	properties
		CreationTimestamp = 20210223145620.000000-360
		ElementName = System Event Log Entry
		RecordData = Drive 0 is removed from disk drive bay 1.
		RecordFormat = string Description
		RecordID = 2
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1->

But from dmesg it is clear that there might be memory issues on that host.
@Papaul thoughts, maybe DIMM A2 is about to fail?

@Marostegui I don't see anything on my end as well. maybe just a temporary memory issue. Showing all 128G RAM on the server. We can close the task if we see the issue again we can try to swap that DIMM with another one.
Thanks.

[Wed May 18 06:44:27 2022] {22}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed May 18 06:44:27 2022] {22}[Hardware Error]: event severity: corrected
[Wed May 18 06:44:27 2022] {22}[Hardware Error]:  Error 0, type: corrected

Mentioned in SAL (#wikimedia-operations) [2022-05-23T18:25:16Z] <ryankemper> T308647 Bringing elastic2054 back into service: ryankemper@elastic2054:~$ sudo pool (it's not currently banned from cluster so nothing to do there)

Thanks for looking into this, all. I've brought the host back into service and will reopen the ticket if problems re-surface, but for now things look good.

Mentioned in SAL (#wikimedia-operations) [2022-06-02T19:08:26Z] <ryankemper> T305646 T308647 Unbanned elastic2033 and elastic2054 from clusters; also pooled elastic2033