Page MenuHomePhabricator

Memory issue on elastic1063 caused elasticsearch to be killed
Open, HighPublic

Description

Elasticsearch has been killed due to an hardware error:

[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]: event severity: corrected
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:  Error 0, type: corrected
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:  fru_text: A8
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   section_type: memory error
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   physical_address: 0x0000001c789ba600
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9 row: 41350 column: 656 
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]: event severity: corrected
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]:  Error 0, type: corrected
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Fri Oct  9 06:55:00 2020] MCE: Killing elasticsearch[e:3361 due to hardware memory corruption fault at 7ff100d68000

@elukey checked DELL's DRAC and didn't see any pressing error reports and restarted elasticsearch there and see if the problem occurs again but a memory test might be worth being run.

Event Timeline

dcausse created this task.Oct 9 2020, 7:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2020, 7:24 AM
Gehel triaged this task as High priority.Oct 12 2020, 3:26 PM
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.

happened again today:

[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: event severity: corrected
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:  Error 0, type: corrected
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:  fru_text: A8
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   section_type: memory error
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   physical_address: 0x0000001c789bd200
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9 row: 41350 column: 832 
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]: event severity: corrected
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:  Error 0, type: corrected
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:  Error 1, type: corrected
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Tue Oct 13 18:15:27 2020] mce: [Hardware Error]: Machine check events logged
[Tue Oct 13 19:09:26 2020] MCE: Killing elasticsearch[e:194372 due to hardware memory corruption fault at 7ed5bbb8c000

Mentioned in SAL (#wikimedia-operations) [2020-11-03T17:45:44Z] <cmjohnson1> shutting elastic1063 down to reseat DIMM T265113

Cmjohnson closed this task as Resolved.Tue, Nov 3, 5:57 PM

I reseated all the DIMM and there were several. I am not getting any Dell h/w errors. Hopefully, the reseat and flea power drain will correct the issue. I am resolving this task. If the problem persists, please re-open and tag me.

dcausse reopened this task as Open.Fri, Nov 6, 10:05 AM

@Cmjohnson thanks for the intervention!
But sadly it happened again today:

[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]: event severity: corrected
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:  Error 0, type: corrected
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:  fru_text: A8
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   section_type: memory error
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   physical_address: 0x0000001c789b9680
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9 row: 41350 column: 600 
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]: event severity: corrected
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]:  Error 0, type: corrected
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Fri Nov  6 09:49:51 2020] MCE: Killing elasticsearch[e:29690 due to hardware memory corruption fault at 7f24a45e6000

Mentioned in SAL (#wikimedia-operations) [2020-11-06T10:06:40Z] <dcausse> restarted elastic on elastic1063 (T265113)

Thanks, @dcausse Still no h/w error in idrac, A ticket with Dell will need to be created, the server is under warranty.