Page MenuHomePhabricator

Memory issue on elastic1063 caused elasticsearch to be killed
Closed, ResolvedPublic

Description

Elasticsearch has been killed due to an hardware error:

[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]: event severity: corrected
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:  Error 0, type: corrected
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:  fru_text: A8
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   section_type: memory error
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   physical_address: 0x0000001c789ba600
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9 row: 41350 column: 656 
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Fri Oct  9 06:54:30 2020] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]: event severity: corrected
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]:  Error 0, type: corrected
[Fri Oct  9 06:54:30 2020] {2}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Fri Oct  9 06:55:00 2020] MCE: Killing elasticsearch[e:3361 due to hardware memory corruption fault at 7ff100d68000

@elukey checked DELL's DRAC and didn't see any pressing error reports and restarted elasticsearch there and see if the problem occurs again but a memory test might be worth being run.

Related Objects

Event Timeline

Gehel triaged this task as High priority.Oct 12 2020, 3:26 PM
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.

happened again today:

[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: event severity: corrected
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:  Error 0, type: corrected
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:  fru_text: A8
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   section_type: memory error
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   physical_address: 0x0000001c789bd200
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9 row: 41350 column: 832 
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[Tue Oct 13 18:14:55 2020] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]: event severity: corrected
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:  Error 0, type: corrected
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:  Error 1, type: corrected
[Tue Oct 13 18:14:55 2020] {4}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Tue Oct 13 18:15:27 2020] mce: [Hardware Error]: Machine check events logged
[Tue Oct 13 19:09:26 2020] MCE: Killing elasticsearch[e:194372 due to hardware memory corruption fault at 7ed5bbb8c000

Mentioned in SAL (#wikimedia-operations) [2020-11-03T17:45:44Z] <cmjohnson1> shutting elastic1063 down to reseat DIMM T265113

I reseated all the DIMM and there were several. I am not getting any Dell h/w errors. Hopefully, the reseat and flea power drain will correct the issue. I am resolving this task. If the problem persists, please re-open and tag me.

@Cmjohnson thanks for the intervention!
But sadly it happened again today:

[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]: event severity: corrected
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:  Error 0, type: corrected
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:  fru_text: A8
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   section_type: memory error
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   physical_address: 0x0000001c789b9680
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9 row: 41350 column: 600 
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Fri Nov  6 09:49:40 2020] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]: event severity: corrected
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]:  Error 0, type: corrected
[Fri Nov  6 09:49:40 2020] {2}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Fri Nov  6 09:49:51 2020] MCE: Killing elasticsearch[e:29690 due to hardware memory corruption fault at 7f24a45e6000

Mentioned in SAL (#wikimedia-operations) [2020-11-06T10:06:40Z] <dcausse> restarted elastic on elastic1063 (T265113)

Thanks, @dcausse Still no h/w error in idrac, A ticket with Dell will need to be created, the server is under warranty.

Gehel added a subscriber: Gehel.

@Cmjohnson did you receive any news from Dell.

@dcausse I am sorry no, I forgot to put a ticket in with them. I will do that today. Thanks

@Cmjohnson Just checking in here - I think when we left off, a ticket was going to be created with Dell for the hardware memory corruption issues w/ the DIMMs since the server is under warranty. Any news on that front?

The issue that Dell has with this is we cannot determine which DIMM is failed. The hardware logs all look good and do not indicate an error. Their question to me is which DIMM are you going to replace. I do not have an answer for that. The output you posted earlier doesn't specify a DIMM number. Is there a way to determine which DIMM location is the one is failing?

I am attaching the TSR report so you will see none of the h/w logs suggest there is an issue.

As far as I understand it, it's not possible for the linux kernel to map a physical address back to a single dimm. It just doesn't have the information. memtest86 has the same issue. Quoting the memtest86 docs:

The memory controller then decodes this memory address to identify the specific channel, DIMM, rank, DRAM chip, bank, row and column in DRAM using a chipset-specific address decoding scheme.

The short of it is, I think the only way to know what dimm is bad is for dell to give us software that knows whatever chipset specific thing their hardware is doing.

I suppose an alternate method would be a binary search, remove half the memory and let it burn memtest86 for 24h or some such. If fail, remove half again. If success, try the other half. Sounds tedious, and may take significant time to actually get an answer. I wouldn't really suggest it.

Good news bad news, Dell dispatched a new DIMM. The bad news, is we do not know which one and it could take some time to figure that out

@EBernhardson The DIMM arrived, Let me know when it's safe to take the server down. I am going to have to reduce the memory down to the lowest amount to run the server and work our way up from there.

Mentioned in SAL (#wikimedia-operations) [2021-01-28T17:28:53Z] <ebernhardson> ban elastic1063 from production-search-omega-eqiad and production-search-eqiad T265113

The node is safe to take down any time. The ban i've put in place will also prevent it from attempting to re-join the cluster.

@EBernhardson we do not have utilities installed like dmidecode or edac-util. having a list of physical addresses of dimms installed might help narrow down which one has a error. we have 3 address for errors most likely same dimm.
physical_address: 0x0000001c789b9680
physical_address: 0x0000001c789bd200
physical_address: 0x0000001c789ba600

we do not have utilities installed like dmidecode or edac-util

@Jclark-ctr dmidecode is installed, it's in sbin so only visible through sudo.

sudo dmidecode -t memory output for elastic1063: P14042

If i had to guess, the following bit repeated in each failure is probably telling us exactly which module, but i'm not sure how to translate this back into the physical world. dmidecode is giving us information that probably does allow alignment to physical chips, but i'm not sure how to align this error with dmidecode output either.

node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9

Change 659451 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] elasticsearch: include edac-util (ecc userland)

https://gerrit.wikimedia.org/r/659451

Change 659451 merged by Ryan Kemper:
[operations/puppet@production] elasticsearch: include edac-util (ecc userland)

https://gerrit.wikimedia.org/r/659451

Change 659455 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] elasticsearch: missing s at end of edac-utils

https://gerrit.wikimedia.org/r/659455

Change 659455 merged by Ryan Kemper:
[operations/puppet@production] elasticsearch: missing s at end of edac-utils

https://gerrit.wikimedia.org/r/659455

@Jclark-ctr In addition to Erik's point above about dmidecode being installed, we just deployed a patch to install edac-util on all Elasticsearch systems (this includes logstash*, cloudelastic* btw). So edac-util is now available for use

Hi @Jclark-ctr - can you confirm all the firmware/bios/idrac is all updated? I have an email queued up to send to our technical Dell rep on this, but we should make sure it's all updated on the host first, to see if it fixes the issue or if helps show anything new on the TSR report. Thanks, Willy

@RKemper

dmidecode -t 20
``` would be very useful  to trace physical address of memory  we are unsure why it will not return any information.  Would you be able to assist?

Regarding -t 20, dmidecode reports SMBIOS 3.2 present. Per the spec in section 6.2 they list the required structures, and Type 20 is not among them. It suggests our bios decided not to implement this feature.

Additionally, it seems this has happened again today and yet edac-utils claims no errors. dmesg looks fairly complete, this is the first error of this type it contains since Nov 25.

ebernhardson@elastic1063:~$ sudo dmesg -T | tail -n 25
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]: event severity: corrected
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:  Error 0, type: corrected
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:  fru_text: A8
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:   section_type: memory error
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:   physical_address: 0x0000001c789bca00
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:   node: 0 card: 1 module: 1 rank: 1 bank: 2 device: 9 row: 41350 column: 800
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Feb  4 15:06:31 2021] {9}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]: event severity: corrected
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:  Error 0, type: corrected
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:  Error 1, type: corrected
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:  Error 2, type: corrected
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:  Error 3, type: corrected
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:  Error 4, type: corrected
[Thu Feb  4 15:06:31 2021] {10}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Thu Feb  4 15:07:39 2021] mce: [Hardware Error]: Machine check events logged
ebernhardson@elastic1063:~$ sudo edac-util -r
edac-util: No errors to report.

The logical conclusion seems to be that these errors are coming from something other than EDAC (ECC). Poking around at suspicious things, all of these errors have fru_text: A8. Spent some time digging around, best i can guess is FRU is the Field Replacable Unit Information Storage for IPMI. The TSR posted earlier labels the ram as A1-10 and B1-6. Would it be crazy to guess A8, serial 43104508, is our culprit?

@EBernhardson would you be available monday to swap Ram being that it is over 90 days since error we can change memory

@Jclark-ctr yes, I'm available any time after 11 AM PST (19:00 UTC) monday.

Forgot Monday is holiday Tuesday 11 AM PST?

Ooh, holiday! I forgot about that too. Yea tuesday will work.

replaced DIMM A8 if error returns recommend running

edac-util -r

Mentioned in SAL (#wikimedia-operations) [2021-02-24T22:09:16Z] <ryankemper> T265113 Unbanned elastic1063 from both Elasticsearch clusters (production-search-eqiad and production-search-omega-eqiad)

Commands used to unban elastic1063:

curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": null,"_name": null}}}'
curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": null,"_name": null}}}'