Page MenuHomePhabricator

db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08
Closed, ResolvedPublic

Description

mariadb@s6 log at db2097:

It seems it was restarted normally on the 6th, upgraded from 10.1.38 and mysql_upgrade was run:

Jun 06 06:04:47 db2097 systemd[1]: Stopping mariadb database server
[...]
Jun 06 06:05:29 db2097 mysqld[3188]: 2019-06-06  6:05:29 139893604645632 [Note] /opt/wmf-mariadb101/bin/mysqld: Shutdown complete
Jun 06 06:05:29 db2097 systemd[1]: Stopped mariadb database server.
Jun 06 06:05:42 db2097 systemd[1]: Starting mariadb database server...
[...]
Jun 06 06:05:55 db2097 mysqld[20741]: Version: '10.1.39-MariaDB'  socket: '/run/mysqld/my
sqld.s6.sock'  port: 3316  MariaDB Server

Lots of grants skips, but nothing unexpected.

A master change happened around that time:

Jun 06 06:53:03 db2097 mysqld[20741]: 2019-06-06  6:53:03 140306785355520 [Note] 'CHANGE 
MASTER TO executed'

Skipping details because I don't see anything relevant.

Then, 2 days later:

Jun 06 11:43:15 db2097 mysqld[20741]: 2019-06-06 11:43:15 7f67747f7700 InnoDB: Buffer poo
l(s) load completed at 190606 11:43:15
Jun 08 11:15:14 db2097 mysqld[20741]: 190608 11:15:14 [ERROR] mysqld got signal 7 ;
Jun 08 11:15:14 db2097 mysqld[20741]: This could be because you hit a bug. It is also pos
sible that this binary
...
Jun 08 11:15:14 db2097 mysqld[20741]: stack_bottom = 0x0 thread_stack 0x49000
Jun 08 11:15:15 db2097 mysqld[20741]: /opt/wmf-mariadb101/bin/mysqld(my_print_stacktrace+0x2e)[0x5645f0cd702e]
Jun 08 11:15:15 db2097 mysqld[20741]: /opt/wmf-mariadb101/bin/mysqld(handle_fatal_signal+0x3bd)[0x5645f081817d]
Jun 08 11:15:15 db2097 mysqld[20741]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7f9bd09b50e0]
Jun 08 11:15:15 db2097 mysqld[20741]: /opt/wmf-mariadb101/bin/mysqld(adler32_z+0x2ab)[0x5645f0cf093b]
Jun 08 11:15:15 db2097 mysqld[20741]: /opt/wmf-mariadb101/bin/mysqld(+0x9a45d8)[0x5645f0c0a5d8]
Jun 08 11:15:16 db2097 mysqld[20741]: /opt/wmf-mariadb101/bin/mysqld(+0x99549b)[0x5645f0bfb49b]
Jun 08 11:15:16 db2097 mysqld[20741]: /opt/wmf-mariadb101/bin/mysqld(+0x99d54f)[0x5645f0c0354f]
Jun 08 11:15:16 db2097 mysqld[20741]: /opt/wmf-mariadb101/bin/mysqld(+0x99dfb5)[0x5645f0c03fb5]
Jun 08 11:15:16 db2097 mysqld[20741]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7f9bd09ab4a4]
Jun 08 11:15:17 db2097 mysqld[20741]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f9bcf14ad0f]

Ending up in a restart by systemd:

Jun 08 11:15:33 db2097 systemd[1]: Stopped mariadb database server.
Jun 08 11:15:33 db2097 systemd[1]: Starting mariadb database server...

Event Timeline

jcrespo created this task.Jun 9 2019, 11:07 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 9 2019, 11:07 AM
jcrespo updated the task description. (Show Details)Jun 9 2019, 11:15 AM

No backup was running at that time.

I've restarted the slave, just to see if there is corruption; although I would like to redo the replica from 0.

Marostegui added a subscriber: Papaul.

Looks like it might be related to a HW memory issue that has been going on for a few days:

/system1/log1/record16
  Targets
  Properties
    number=16
    severity=Critical
    date=06/08/2019
    time=06:50:14
    description=Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000002, Bank 0x00000010, Status 0xFD000780'001000C1, Address 0x0000000D'42ADCF80, Misc 0x09084000'80000086).
  Verbs
    cd version exit show

/system1/log1/record17
  Targets
  Properties
    number=17
    severity=Critical
    date=06/08/2019
    time=06:50:15
    description=DIMM Failure - Uncorrectable Memory Error (Processor 1, DIMM 3)
  Verbs
    cd version exit show

/system1/log1/record18
  Targets
  Properties
    number=18
    severity=Critical
    date=06/08/2019
    time=11:14:19
    description=Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000022, Bank 0x00000001, Status 0xBD800000'00100134, Address 0x0000000D'42ADCF80, Misc 0x00000000'00000086).
  Verbs

/system1/log1/record19
  Targets
  Properties
    number=19
    severity=Critical
    date=06/09/2019
    time=06:45:15
    description=Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000002, Bank 0x00000010, Status 0xFD000580'001000C1, Address 0x0000000D'C2A9CF80, Misc 0x09084001'00000086).
  Verbs
    cd version exit show


/system1/log1/record20
  Targets
  Properties
    number=20
    severity=Critical
    date=06/09/2019
    time=06:45:16
    description=DIMM Failure - Uncorrectable Memory Error (Processor 1, DIMM 3)
  Verbs
    cd version exit show

Probably also worth contacting HP? @Papaul ?
@jcrespo +1 to rebuild the host (maybe even re-image it)

Restricted Application added a project: Operations. · View Herald TranscriptJun 9 2019, 2:36 PM
Marostegui moved this task from Triage to In progress on the DBA board.Jun 9 2019, 2:47 PM
ArielGlenn triaged this task as Normal priority.Jun 11 2019, 8:04 AM

I am going to run a data check, no matter what we do in the end with the instance.

Mentioned in SAL (#wikimedia-operations) [2019-06-17T16:27:41Z] <jynus> starting data check on db2097+db2046, expect increase in read row rate T225378

Check finished, no differences found.

Great!
So what do you have in mind?

jcrespo claimed this task.Jun 26 2019, 12:06 PM
[6498437.928368] mce: [Hardware Error]: Machine check events logged
[6498437.928393] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[6498437.928395] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd004780001000c1
[6498437.928396] EDAC skx MC1: TSC 5342fa81aa1c04 
[6498437.928396] EDAC skx MC1: ADDR d8229cf80 
[6498437.928399] EDAC skx MC1: MISC 908400100000086 
[6498437.928402] EDAC skx MC1: PROCESSOR 0:50654 TIME 1561517951 SOCKET 0 APIC 4
[6498437.928408] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xd8229c offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:6432 col:138)
[6510762.575543] Process accounting resumed
[6596958.282298] Process accounting resumed
[6683155.141534] Process accounting resumed
[6769351.182001] Process accounting resumed
[6855547.716695] Process accounting resumed
[6941743.671751] Process accounting resumed
[7022624.571104] mce: [Hardware Error]: Machine check events logged
[7022624.571161] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[7022624.571163] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd004940001000c1
[7022624.571164] EDAC skx MC1: TSC 59f742acdaddae 
[7022624.571165] EDAC skx MC1: ADDR dfa29cf80 
[7022624.571165] EDAC skx MC1: MISC 908400200000086 
[7022624.571167] EDAC skx MC1: PROCESSOR 0:50654 TIME 1562043375 SOCKET 0 APIC 4
[7022624.571175] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xdfa29c offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:6bb2 col:138)
[7027940.094961] Process accounting resumed
[7107442.018032] mce: [Hardware Error]: Machine check events logged
[7107442.018086] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[7107442.018087] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd000c40001000c1
[7107442.018088] EDAC skx MC1: TSC 5b0cf7d688dc44 
[7107442.018089] EDAC skx MC1: ADDR daa29cf80 
[7107442.018090] EDAC skx MC1: MISC 908400100000086 
[7107442.018090] EDAC skx MC1: PROCESSOR 0:50654 TIME 1562128393 SOCKET 0 APIC 4
[7107442.018097] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xdaa29c offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:66b2 col:138)
[7107691.803948] mce: [Hardware Error]: Machine check events logged
[7107691.804005] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[7107691.804007] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd000500001000c1
[7107691.804008] EDAC skx MC1: TSC 5b0dc934c6be34 
[7107691.804009] EDAC skx MC1: ADDR dda2dcf80 
[7107691.804009] EDAC skx MC1: MISC 908400800000086 
[7107691.804010] EDAC skx MC1: PROCESSOR 0:50654 TIME 1562128643 SOCKET 0 APIC 4
[7107691.804019] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xdda2dc offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:69b3 col:138)
[7114136.812186] Process accounting resumed

Mentioned in SAL (#wikimedia-operations) [2019-07-03T13:24:29Z] <jynus> upgrade and restart db2097 T225378

462 - Uncorrectable Memory Error Threshold Exceeded (Processor 1, DIMM 3).  The
DIMM is mapped out and is currently not available.
Action: Take corrective action for the failing DIMM. Re-map all DIMMs back into
the memory map in RBSU. If the issue persists, contact support.


315 - An uncorrectable memory error was detected prior to this system boot.
Action: Check the Integrated Management Log (IML) for additional information.


511 - One or more DIMMs have been mapped out due to a memory error, resulting
in an unbalanced memory configuration across memory controllers. This may
result in non-optimal memory performance.
Action: See the Integrated Management Log (IML) for information on the memory
error.  Consult documentation for memory population guidelines
jcrespo reassigned this task from jcrespo to Papaul.Jul 3 2019, 1:44 PM

a memory stick of db2097 is literally broken:

root@db2097:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:         483434       34788      447537           9        1108      446146
Swap:          7628           0        7628
root@db2098:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:         515690      434175       58717        2009       22797       76167
Swap:          7628           0        7628

@Papaul please see previous messages and tell support

Papaul added a comment.Jul 5 2019, 3:07 PM

Your request is being worked on under reference number 5339905554
Status: Case is generated and in Progress

Product description: HPE ProLiant DL360 Gen10 8SFF Configure-to-order Server
Product number: 867959-B21
Serial number:
Subject: DL360 Gen10 -Dimm Failure

Papaul added a comment.Jul 5 2019, 4:06 PM

Will be receiving the DIMM on Monday

Papaul added a comment.Jul 5 2019, 4:47 PM

Dear Mr Papaul Tshibamba,

Hewlett Packard Enterprise Reference Number: 5339905554

STATUS: Customer Self Repair Part has been shipped

Part/s shipped: 850881-001
Part description: SPS-DIMM 32GB PC4-2666V-R 2Gx4
Carrier Name: UPSN
Tracking Number: 1Z7557870100523464

Product description: HPE ProLiant DL360 Gen10 Server Models
Product number: 867959-B21
Serial number:
Problem description:

Dimm Failure
Issue exists 5/7/19
AHS Logs attached

Mentioned in SAL (#wikimedia-operations) [2019-07-08T15:15:32Z] <jynus> shutting down db2097 T225378 T216240

Papaul closed this task as Resolved.Jul 8 2019, 3:39 PM

Before


After

This is complete

return tracking information

jcrespo reopened this task as Open.Jul 8 2019, 3:39 PM
jcrespo claimed this task.
Mem:         515690

HW seems to be fixed, owning for the followup (software) steps.

jcrespo reassigned this task from jcrespo to Marostegui.Jul 19 2019, 5:27 PM
Marostegui closed this task as Resolved.Jul 22 2019, 5:04 AM
Marostegui reassigned this task from Marostegui to jcrespo.

As spoken, I am going to close this as the scope of the ticket is done.
I will create a new one to re-image this host with buster+10.3 and rebuild its data