Page MenuHomePhabricator

db2127 memory errors
Closed, ResolvedPublic

Description

db2127 logged some memory errors:

[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]: event severity: corrected
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:  Error 0, type: corrected
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:  fru_text: A2
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:   section_type: memory error
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:   physical_address: 0x0000000c41947c00
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:   node: 0 card: 1 module: 0 rank: 0 bank: 2 device: 2 row: 13969 column: 904
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Sun Aug 30 17:03:20 2020] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[Sun Aug 30 17:03:20 2020] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
[Sun Aug 30 17:03:20 2020] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Aug 30 17:03:20 2020] {2}[Hardware Error]: event severity: corrected
[Sun Aug 30 17:03:20 2020] {2}[Hardware Error]:  Error 0, type: corrected
[Sun Aug 30 17:03:20 2020] {2}[Hardware Error]:   section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
[Sun Aug 30 17:04:55 2020] mce: [Hardware Error]: Machine check events logged

There are no HW logs on the idrac, so I am not sure if we can get a replacement for this DIMM with this syslog entry.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui moved this task from Triage to In progress on the DBA board.

The log on says "It has been corrected by h/w and requires no further action" so i don't think this will be enough to replace the memory because it is not saying that there is an error but there were an error and it was corrected so i will recommend first to upgrade the firmware and see if the new firmware will catch the DIMM error and log it

Excellent, makes sense @Papaul
Right now it is not a good moment to depool an s3 host due to some on-going investigations. I will ping you once we are ready to depool this host and get it upgraded.
Thank you

The log on says "It has been corrected by h/w and requires no further action" so i don't think this will be enough to replace the memory because it is not saying that there is an error but there were an error and it was corrected so i will recommend first to upgrade the firmware and see if the new firmware will catch the DIMM error and log it

@Papaul let me know which day next week you'd like me to put this host down so you can proceed with the firmware/bios upgrades
Thank you!

@Marostegui any day that works for you works for me as well

Thank you @Papaul - I will have it ready by Monday

Mentioned in SAL (#wikimedia-operations) [2020-09-21T08:47:30Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2127 T262247', diff saved to https://phabricator.wikimedia.org/P12680 and previous config saved to /var/cache/conftool/dbconfig/20200921-084730-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-21T08:47:59Z] <marostegui> Stop MySQL on db2127 for on-site maintenance - T262247

@Papaul db2127's is now off, you can proceed whenever you want with the upgrades

Mentioned in SAL (#wikimedia-operations) [2020-09-21T15:39:23Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12693 and previous config saved to /var/cache/conftool/dbconfig/20200921-153923-root.json

Mentioned in SAL (#wikimedia-operations) [2020-09-21T15:54:26Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2127 (re)pooling @ 50%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12694 and previous config saved to /var/cache/conftool/dbconfig/20200921-155426-root.json

Mentioned in SAL (#wikimedia-operations) [2020-09-21T16:09:30Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12695 and previous config saved to /var/cache/conftool/dbconfig/20200921-160929-root.json

Mentioned in SAL (#wikimedia-operations) [2020-09-21T16:24:33Z] <marostegui@cumin1001> dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Slowly repool after on-site maintenance T262247 ', diff saved to https://phabricator.wikimedia.org/P12696 and previous config saved to /var/cache/conftool/dbconfig/20200921-162433-root.json

Host was repooled
Thank you Papaul!