Page MenuHomePhabricator

db2085 crashed - memory issues
Closed, ResolvedPublic

Description

Apparently db2085 has crashed.
Nothing on HW logs though:

racadm>>getsel
racadm getsel
Record:      1
Date/Time:   02/14/2019 15:13:00
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------

At the time of the crash, the host was running a heavy alter table on enwiki.revision T239453

OS logs suggest A3 DIMM module is having issues:

Jan 19 07:11:51 db2085 kernel: [3258047.293317] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0xec2237 offset:0x980 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
Jan 19 07:12:38 db2085 kernel: [3258094.261835] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jan 19 07:12:38 db2085 kernel: [3258094.261840] {12}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 19 07:12:38 db2085 kernel: [3258094.261843] {12}[Hardware Error]: event severity: corrected
Jan 19 07:12:38 db2085 kernel: [3258094.261846] {12}[Hardware Error]:  Error 0, type: corrected
Jan 19 07:12:38 db2085 kernel: [3258094.261848] {12}[Hardware Error]:  fru_text: A3
Jan 19 07:12:38 db2085 kernel: [3258094.261851] {12}[Hardware Error]:   section_type: memory error
Jan 19 07:12:38 db2085 kernel: [3258094.261854] {12}[Hardware Error]:   error_status: 0x0000000000000400
Jan 19 07:12:38 db2085 kernel: [3258094.261856] {12}[Hardware Error]:   physical_address: 0x0000002ed2272100
Jan 19 07:12:38 db2085 kernel: [3258094.261863] {12}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 0 row: 29985 column: 640
Jan 19 07:12:38 db2085 kernel: [3258094.261866] {12}[Hardware Error]:   error_type: 2, single-bit ECC

Can you upgrade BIOS and firmwares?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
racadm>>serveraction powerstatus

racadm serveraction powerstatus
Server power status: OFF

racadm>>

Nothing relevant on centrallog1001 from db2085.

I have powered it back on - no errors on boot.
MySQL hasn't been started.

Mentioned in SAL (#wikimedia-operations) [2020-01-19T12:02:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2085:3311, db2085:3318 T243148', diff saved to https://phabricator.wikimedia.org/P10210 and previous config saved to /var/cache/conftool/dbconfig/20200119-120236-marostegui.json

@Papaul errors on the OS logs:

Jan 19 07:06:56 db2085 kernel: [3257752.197344] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jan 19 07:06:56 db2085 kernel: [3257752.197346] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 19 07:06:56 db2085 kernel: [3257752.197347] {1}[Hardware Error]: event severity: corrected
Jan 19 07:06:56 db2085 kernel: [3257752.197348] {1}[Hardware Error]:  Error 0, type: corrected
Jan 19 07:06:56 db2085 kernel: [3257752.197349] {1}[Hardware Error]:  fru_text: A3
Jan 19 07:06:56 db2085 kernel: [3257752.197350] {1}[Hardware Error]:   section_type: memory error
Jan 19 07:06:56 db2085 kernel: [3257752.197351] {1}[Hardware Error]:   error_status: 0x0000000000000400
Jan 19 07:06:56 db2085 kernel: [3257752.197352] {1}[Hardware Error]:   physical_address: 0x0000002ed2212e80
Jan 19 07:06:56 db2085 kernel: [3257752.197354] {1}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 0 row: 29984 column: 184
Jan 19 07:06:56 db2085 kernel: [3257752.197355] {1}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 19 07:06:56 db2085 kernel: [3257752.197369] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 19 07:06:56 db2085 kernel: [3257752.197371] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jan 19 07:06:56 db2085 kernel: [3257752.197373] EDAC sbridge MC0: TSC 16bdae1cf12f75c
Jan 19 07:06:56 db2085 kernel: [3257752.197375] EDAC sbridge MC0: ADDR 2ed2212e80
Jan 19 07:06:56 db2085 kernel: [3257752.197375] EDAC sbridge MC0: MISC 0
Jan 19 07:06:56 db2085 kernel: [3257752.197376] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1579417616 SOCKET 0 APIC 0
Jan 19 07:06:56 db2085 kernel: [3257752.197400] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2ed2212 offset:0xe80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)

Jan 19 07:08:50 db2085 kernel: [3257866.371144] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jan 19 07:08:50 db2085 kernel: [3257866.371147] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 19 07:08:50 db2085 kernel: [3257866.371149] {2}[Hardware Error]: event severity: corrected
Jan 19 07:08:50 db2085 kernel: [3257866.371152] {2}[Hardware Error]:  Error 0, type: corrected
Jan 19 07:08:50 db2085 kernel: [3257866.371154] {2}[Hardware Error]:  fru_text: A3
Jan 19 07:08:50 db2085 kernel: [3257866.371155] {2}[Hardware Error]:   section_type: memory error
Jan 19 07:08:50 db2085 kernel: [3257866.371157] {2}[Hardware Error]:   error_status: 0x0000000000000400
Jan 19 07:08:50 db2085 kernel: [3257866.371159] {2}[Hardware Error]:   physical_address: 0x0000002ec2277700
Jan 19 07:08:50 db2085 kernel: [3257866.371164] {2}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 0 row: 29729 column: 984
Jan 19 07:08:50 db2085 kernel: [3257866.371165] {2}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 19 07:08:50 db2085 kernel: [3257866.371199] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 19 07:08:50 db2085 kernel: [3257866.371208] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jan 19 07:08:50 db2085 kernel: [3257866.371214] EDAC sbridge MC0: TSC 16bdb3ed85b06ec
Jan 19 07:08:50 db2085 kernel: [3257866.371220] EDAC sbridge MC0: ADDR 2ec2277700
Jan 19 07:08:50 db2085 kernel: [3257866.371225] EDAC sbridge MC0: MISC 0
Jan 19 07:08:50 db2085 kernel: [3257866.371233] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1579417730 SOCKET 0 APIC 0
Jan 19 07:08:50 db2085 kernel: [3257866.371262] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2ec2277 offset:0x700 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
Jan 19 07:10:46 db2085 kernel: [3257981.841567] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jan 19 07:10:46 db2085 kernel: [3257981.841577] EDAC sbridge MC0: TSC 16bdb9cf01a8042
Jan 19 07:10:46 db2085 kernel: [3257981.841585] EDAC sbridge MC0: ADDR 1ed2252000
Jan 19 07:10:46 db2085 kernel: [3257981.841590] EDAC sbridge MC0: MISC 0
Jan 19 07:10:46 db2085 kernel: [3257981.841599] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1579417846 SOCKET 0 APIC 0
Jan 19 07:10:46 db2085 kernel: [3257981.841635] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x1ed2252 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
Jan 19 07:10:46 db2085 mcelog: warning: 32 bytes ignored in each record
Jan 19 07:10:46 db2085 mcelog: consider an update
Jan 19 07:10:46 db2085 kernel: [3257982.214500] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 19 07:10:46 db2085 kernel: [3257982.214503] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jan 19 07:10:46 db2085 kernel: [3257982.214505] EDAC sbridge MC0: TSC 16bdb9d3de79c72
Jan 19 07:10:46 db2085 kernel: [3257982.214506] EDAC sbridge MC0: ADDR ed2251980
Jan 19 07:10:46 db2085 kernel: [3257982.214506] EDAC sbridge MC0: MISC 0
Jan 19 07:10:46 db2085 kernel: [3257982.214508] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1579417846 SOCKET 0 APIC 0
Jan 19 07:10:46 db2085 kernel: [3257982.214529] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0xed2251 offset:0x980 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
Jan 19 07:10:46 db2085 mcelog: warning: 32 bytes ignored in each record
Jan 19 07:10:46 db2085 mcelog: consider an update
Jan 19 07:11:14 db2085 kernel: [3258010.101022] ghes_print_estatus: 1 callbacks suppressed
Jan 19 07:11:14 db2085 kernel: [3258010.101026] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jan 19 07:11:14 db2085 kernel: [3258010.101030] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 19 07:11:14 db2085 kernel: [3258010.101033] {10}[Hardware Error]: event severity: corrected
Jan 19 07:11:14 db2085 kernel: [3258010.101036] {10}[Hardware Error]:  Error 0, type: corrected
Jan 19 07:11:14 db2085 kernel: [3258010.101038] {10}[Hardware Error]:  fru_text: A3
Jan 19 07:11:14 db2085 kernel: [3258010.101041] {10}[Hardware Error]:   section_type: memory error
Jan 19 07:11:14 db2085 kernel: [3258010.101044] {10}[Hardware Error]:   error_status: 0x0000000000000400
Jan 19 07:11:14 db2085 kernel: [3258010.101046] {10}[Hardware Error]:   physical_address: 0x0000002ed2251580
Jan 19 07:11:14 db2085 kernel: [3258010.101052] {10}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 0 row: 29985 column: 80
Jan 19 07:11:14 db2085 kernel: [3258010.101054] {10}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 19 07:11:14 db2085 kernel: [3258010.101097] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 19 07:11:14 db2085 kernel: [3258010.101102] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jan 19 07:11:14 db2085 kernel: [3258010.101104] EDAC sbridge MC0: TSC 16bdbb3f73254a2
Jan 19 07:11:14 db2085 kernel: [3258010.101106] EDAC sbridge MC0: ADDR 2ed2251580
Jan 19 07:11:14 db2085 kernel: [3258010.101108] EDAC sbridge MC0: MISC 0
Jan 19 07:11:14 db2085 kernel: [3258010.101112] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1579417874 SOCKET 0 APIC 0
Jan 19 07:11:14 db2085 kernel: [3258010.101142] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2ed2251 offset:0x580 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
Jan 19 07:11:14 db2085 mcelog: warning: 32 bytes ignored in each record
Jan 19 07:11:14 db2085 mcelog: consider an update
Jan 19 07:11:51 db2085 kernel: [3258047.293247] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jan 19 07:11:51 db2085 kernel: [3258047.293250] {11}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 19 07:11:51 db2085 kernel: [3258047.293251] {11}[Hardware Error]: event severity: corrected
Jan 19 07:11:51 db2085 kernel: [3258047.293253] {11}[Hardware Error]:  Error 0, type: corrected

Jan 19 07:11:51 db2085 kernel: [3258047.293255] {11}[Hardware Error]:  fru_text: A3
Jan 19 07:11:51 db2085 kernel: [3258047.293256] {11}[Hardware Error]:   section_type: memory error
Jan 19 07:11:51 db2085 kernel: [3258047.293257] {11}[Hardware Error]:   error_status: 0x0000000000000400
Jan 19 07:11:51 db2085 kernel: [3258047.293258] {11}[Hardware Error]:   physical_address: 0x0000000ec2237980
Jan 19 07:11:51 db2085 kernel: [3258047.293262] {11}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 0 row: 29728 column: 992
Jan 19 07:11:51 db2085 kernel: [3258047.293263] {11}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 19 07:11:51 db2085 kernel: [3258047.293282] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 19 07:11:51 db2085 kernel: [3258047.293284] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jan 19 07:11:51 db2085 kernel: [3258047.293285] EDAC sbridge MC0: TSC 16bdbd245b1c2b8
Jan 19 07:11:51 db2085 kernel: [3258047.293286] EDAC sbridge MC0: ADDR ec2237980
Jan 19 07:11:51 db2085 kernel: [3258047.293287] EDAC sbridge MC0: MISC 0
Jan 19 07:11:51 db2085 kernel: [3258047.293294] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1579417911 SOCKET 0 APIC 0
Jan 19 07:11:51 db2085 kernel: [3258047.293317] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0xec2237 offset:0x980 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
Jan 19 07:12:38 db2085 kernel: [3258094.261835] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jan 19 07:12:38 db2085 kernel: [3258094.261840] {12}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 19 07:12:38 db2085 kernel: [3258094.261843] {12}[Hardware Error]: event severity: corrected
Jan 19 07:12:38 db2085 kernel: [3258094.261846] {12}[Hardware Error]:  Error 0, type: corrected
Jan 19 07:12:38 db2085 kernel: [3258094.261848] {12}[Hardware Error]:  fru_text: A3
Jan 19 07:12:38 db2085 kernel: [3258094.261851] {12}[Hardware Error]:   section_type: memory error
Jan 19 07:12:38 db2085 kernel: [3258094.261854] {12}[Hardware Error]:   error_status: 0x0000000000000400
Jan 19 07:12:38 db2085 kernel: [3258094.261856] {12}[Hardware Error]:   physical_address: 0x0000002ed2272100
Jan 19 07:12:38 db2085 kernel: [3258094.261863] {12}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 0 row: 29985 column: 640
Jan 19 07:12:38 db2085 kernel: [3258094.261866] {12}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 19 07:12:38 db2085 kernel: [3258094.261883] mce_notify_irq: 6 callbacks suppressed
Jan 19 07:12:38 db2085 kernel: [3258094.261884] mce: [Hardware Error]: Machine check events logged
Jan 19 07:12:38 db2085 kernel: [3258094.261905] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 19 07:12:38 db2085 kernel: [3258094.261916] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jan 19 07:12:38 db2085 kernel: [3258094.261927] EDAC sbridge MC0: TSC 16bdbf88b9a21a9
Jan 19 07:12:38 db2085 kernel: [3258094.261938] EDAC sbridge MC0: ADDR 2ed2272100
Jan 19 07:12:38 db2085 kernel: [3258094.261945] EDAC sbridge MC0: MISC 0
Jan 19 07:12:38 db2085 kernel: [3258094.261960] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1579417958 SOCKET 0 APIC 0
Jan 19 07:12:38 db2085 kernel: [3258094.262001] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2ed2272 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)

So it looks like a memory dimm issue with A3 dimm?
@Papaul this host is under warranty until March 2020, so let's try to order a new memory DIMM before we actually get an "uncorrectable" crash? I guess this module is about to fail.

I have started MySQL to start a data check, so we'd need to depool this host before we replace the DIMM.

Marostegui renamed this task from db2085 crashed to db2085 crashed - memory issues.Jan 20 2020, 6:19 AM
Marostegui updated the task description. (Show Details)

Change 565808 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2085: Disable notifications

https://gerrit.wikimedia.org/r/565808

Change 565808 merged by Marostegui:
[operations/puppet@production] db2085: Disable notifications

https://gerrit.wikimedia.org/r/565808

Mentioned in SAL (#wikimedia-operations) [2020-01-20T08:10:38Z] <marostegui> Compare data on db2085:3318 - T243148

enwiki data check finished without any issues.
wikidatawiki check still on-going

wikidatawiki data check finished without any drifts.

Papaul triaged this task as Medium priority.Jan 21 2020, 3:32 PM

Mentioned in SAL (#wikimedia-operations) [2020-01-22T14:39:58Z] <marostegui> Stop MySQL on db2085:3311 and db2085:3318 for onsite maintenance - T243148

Papaul added a subscriber: Papaul.

Before

BIOS Version	
2.9.1
Firmware Version	
2.61.60.60

After
BIOS Version
2.11.0
Firmware Version
2.70.70.70

@Marostegui FW upgrade complete

For the record, we are also going to try to contact Dell with the OS logs to see if we can get a new DIMM before A3 goes from correctable to uncorrectable.

Change 566544 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2085: Enable notifications

https://gerrit.wikimedia.org/r/566544

Change 566544 merged by Marostegui:
[operations/puppet@production] db2085: Enable notifications

https://gerrit.wikimedia.org/r/566544

Create Dispatch: Success
You have successfully submitted request SR1011684731.

Create Dispatch: Success
You have successfully submitted request SR1011684731.

Thank you! Let's see what they say

Mentioned in SAL (#wikimedia-operations) [2020-01-23T17:33:04Z] <marostegui> Poweroff db2085:3311 and db2085:3318 for maintenance - T243148

For the record: @Papaul replaced DIMM A3 with a new one sent by Dell

Mentioned in SAL (#wikimedia-operations) [2020-01-24T06:12:29Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2085 after memory replacement T243148', diff saved to https://phabricator.wikimedia.org/P10256 and previous config saved to /var/cache/conftool/dbconfig/20200124-061228-marostegui.json

Marostegui reassigned this task from Marostegui to Papaul.