Page MenuHomePhabricator

db2118 crashed and rebooted due to HW
Closed, ResolvedPublic

Description

-------------------------------------------------------------------------------
Record:      27
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------

Event Timeline

taavi triaged this task as Unbreak Now! priority.Feb 24 2024, 10:18 AM
taavi created this task.

Started it - InnoDB doing recovery, leaving it on RO. Once it's caught up I am switching it

Even though mariadb is up, it is all in RO. I don't want to risk the data.

-------------------------------------------------------------------------------
Record:      26
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   02/24/2024 10:08:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      38
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      39
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      40
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      41
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      42
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      43
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      45
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      46
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      47
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      48
Date/Time:   02/24/2024 10:08:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      49
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      50
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      51
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      52
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      53
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      54
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      55
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      56
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      57
Date/Time:   02/24/2024 10:08:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      58
Date/Time:   02/24/2024 10:08:22
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      59
Date/Time:   02/24/2024 10:08:22
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Marostegui added a subscriber: wiki_willy.

@wiki_willy can we contact the vendor about this issue which caused a reboot?

Record:      27
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
Marostegui renamed this task from db2118 crashed to db2118 crashed and rebooted due to HW.Feb 24 2024, 10:29 AM
Marostegui updated the task description. (Show Details)
Marostegui lowered the priority of this task from Unbreak Now! to High.Feb 24 2024, 10:48 AM

Everything should be back to normal now.

++ @VRiley-WMF & @Jclark-ctr

@wiki_willy can we contact the vendor about this issue which caused a reboot?

Record:      27
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.

This server is in codfw. I'll get a report sent to Dell asap to get a replacement cpu

actually, this server is not in warranty. I will try to find a viable replacement from the decommissioned inventory in the morning.

Thanks for picking this up @Jhancock.wm. @Marostegui - since this host looks like it's close to being refreshed in T355350, do you want to just wait for the refreshed server to be setup instead of fixing this one? Thanks, Willy

Thanks for picking this up @Jhancock.wm. @Marostegui - since this host looks like it's close to being refreshed in T355350, do you want to just wait for the refreshed server to be setup instead of fixing this one? Thanks, Willy

Ah yes. I was not aware of this. We can wait for its new replacement server. Thanks!

Change 1006750 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2118: Notes about its crash

https://gerrit.wikimedia.org/r/1006750

Change 1006750 merged by Marostegui:

[operations/puppet@production] db2118: Notes about its crash

https://gerrit.wikimedia.org/r/1006750

The data looks correct. I am not going to repool this host for now, I am going to wait until its replacement in T355350 gets installed and simply clone that one and decommission this one.

Marostegui lowered the priority of this task from High to Medium.Feb 28 2024, 7:11 AM

Mentioned in SAL (#wikimedia-operations) [2024-02-29T06:26:01Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Pool db2218 with 1% weight only T358421 T355422', diff saved to https://phabricator.wikimedia.org/P58171 and previous config saved to /var/cache/conftool/dbconfig/20240229-062601-marostegui.json

Change 1007481 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] instances.yaml: Remove db2118 from dbctl

https://gerrit.wikimedia.org/r/1007481

Change 1007481 merged by Marostegui:

[operations/puppet@production] instances.yaml: Remove db2118 from dbctl

https://gerrit.wikimedia.org/r/1007481

Marostegui claimed this task.

This host will no longer come back to production. I will decommission it in a couple of days.