Page MenuHomePhabricator

db2084 temporary correctable hardware errors
Closed, ResolvedPublic

Description

Creating this ticket for the record as it was a self correctable error - similar to T222050: db1107 (eventlogging db master) possibly memory issues - which recovered by itself:

[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]: event severity: corrected
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]:  Error 0, type: corrected
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]:  fru_text: A1
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]:   section_type: memory error
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]:   physical_address: 0x0000003cb28d3cc0
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 58147 column: 240
[Sun Jun 16 09:05:25 2019] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[Sun Jun 16 09:05:25 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 09:05:25 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sun Jun 16 09:05:25 2019] EDAC sbridge MC0: TSC 110e41a6643fb9
[Sun Jun 16 09:05:25 2019] EDAC sbridge MC0: ADDR 3cb28d3cc0
[Sun Jun 16 09:05:25 2019] EDAC sbridge MC0: MISC 0
[Sun Jun 16 09:05:25 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1560675971 SOCKET 0 APIC 0
[Sun Jun 16 09:05:25 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3cb28d3 offset:0xcc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[Sun Jun 16 09:05:55 2019] mce: [Hardware Error]: Machine check events logged
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]: event severity: corrected
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]:  Error 0, type: corrected
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]:  fru_text: A1
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]:   section_type: memory error
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]:   error_status: 0x0000000000000400
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]:   physical_address: 0x0000003cb28d5340
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 58147 column: 328
[Sun Jun 16 09:21:13 2019] {4}[Hardware Error]:   error_type: 2, single-bit ECC
[Sun Jun 16 09:21:13 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 09:21:13 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sun Jun 16 09:21:13 2019] EDAC sbridge MC0: TSC 1111462c90940d
[Sun Jun 16 09:21:13 2019] EDAC sbridge MC0: ADDR 3cb28d5340
[Sun Jun 16 09:21:13 2019] EDAC sbridge MC0: MISC 0
[Sun Jun 16 09:21:13 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1560676919 SOCKET 0 APIC 0
[Sun Jun 16 09:21:13 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3cb28d5 offset:0x340 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[Sun Jun 16 09:21:29 2019] mce: [Hardware Error]: Machine check events logged
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]: event severity: corrected
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]:  Error 0, type: corrected
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]:  fru_text: A1
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]:   section_type: memory error
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]:   error_status: 0x0000000000000400
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]:   physical_address: 0x0000003cb28f0040
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 58147 column: 512
[Sun Jun 16 10:48:17 2019] {5}[Hardware Error]:   error_type: 2, single-bit ECC
[Sun Jun 16 10:48:17 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 10:48:17 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sun Jun 16 10:48:17 2019] EDAC sbridge MC0: TSC 1121e7a5975499
[Sun Jun 16 10:48:17 2019] EDAC sbridge MC0: ADDR 3cb28f0040
[Sun Jun 16 10:48:17 2019] EDAC sbridge MC0: MISC 0
[Sun Jun 16 10:48:17 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1560682143 SOCKET 0 APIC 0
[Sun Jun 16 10:48:17 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3cb28f0 offset:0x40 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[Sun Jun 16 10:49:41 2019] mce: [Hardware Error]: Machine check events logged
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]: event severity: corrected
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]:  Error 0, type: corrected
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]:  fru_text: A1
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]:   section_type: memory error
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]:   error_status: 0x0000000000000400
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]:   physical_address: 0x0000003cb28f69c0
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 58147 column: 928
[Sun Jun 16 10:52:33 2019] {6}[Hardware Error]:   error_type: 2, single-bit ECC
[Sun Jun 16 10:52:33 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 10:52:33 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sun Jun 16 10:52:33 2019] EDAC sbridge MC0: TSC 1122b7fde5f06d
[Sun Jun 16 10:52:33 2019] EDAC sbridge MC0: ADDR 3cb28f69c0
[Sun Jun 16 10:52:33 2019] EDAC sbridge MC0: MISC 0
[Sun Jun 16 10:52:33 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1560682399 SOCKET 0 APIC 0
[Sun Jun 16 10:52:33 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3cb28f6 offset:0x9c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[Sun Jun 16 10:54:52 2019] mce: [Hardware Error]: Machine check events logged

There is nothing logged on the HW logs either.

Event Timeline

Some more errors from yesterday evening:

[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]: event severity: corrected
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:  Error 0, type: corrected
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:  fru_text: A1
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   section_type: memory error
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   error_status: 0x0000000000000400
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   physical_address: 0x0000003cb28d7e40
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 58147 column: 504
[Sun Jun 16 21:33:58 2019] {7}[Hardware Error]:   error_type: 2, single-bit ECC
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: TSC 119d3ab72dafe6
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: ADDR 3cb28d7e40
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: MISC 0
[Sun Jun 16 21:33:58 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1560720886 SOCKET 0 APIC 0
[Sun Jun 16 21:33:58 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3cb28d7 offset:0xe40 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
[Sun Jun 16 21:35:44 2019] mce: [Hardware Error]: Machine check events logged

Still no errors logged on HW logs.
I am going to give this host a reboot to see if we get something else.

Mentioned in SAL (#wikimedia-operations) [2019-06-17T06:04:45Z] <marostegui> Stop MySQ on db2084 to reboot the host T225884

Change 517364 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2084

https://gerrit.wikimedia.org/r/517364

Change 517364 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2084

https://gerrit.wikimedia.org/r/517364

Host rebooted. No new logs on HW side.

jijiki triaged this task as Medium priority.Jun 18 2019, 9:56 AM
jijiki added a subscriber: jijiki.

@Marostegui are we good to mark this as resolved?

Not yet, I haven't seen more errors but I want to wait until icinga alert clears up, let's give it another 24h

And it finally cleared up

23:38:30 <+icinga-wm> RECOVERY - EDAC syslog messages on db2084 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db2084&var-datasource=codfw+prometheus/ops