------------------------------------------------------------------------------- Record: 27 Date/Time: 02/24/2024 10:08:18 Source: system Severity: Critical Description: CPU 1 machine check error detected. -------------------------------------------------------------------------------
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Unknown Object (Task) | |||||
Resolved | Jhancock.wm | T355350 Q#:rack/setup/install db2196-db2220 | |||
Resolved | ABran-WMF | T355422 Productionize db2196-db2220 | |||
Resolved | Marostegui | T358421 db2118 crashed and rebooted due to HW | |||
Resolved | Marostegui | T358423 Switchover s7 master (db2118 -> db2121) |
Event Timeline
Started it - InnoDB doing recovery, leaving it on RO. Once it's caught up I am switching it
------------------------------------------------------------------------------- Record: 26 Date/Time: 02/24/2024 10:08:18 Source: system Severity: Ok Description: A problem was detected related to the previous server boot. ------------------------------------------------------------------------------- Record: 27 Date/Time: 02/24/2024 10:08:18 Source: system Severity: Critical Description: CPU 1 machine check error detected. ------------------------------------------------------------------------------- Record: 28 Date/Time: 02/24/2024 10:08:18 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 29 Date/Time: 02/24/2024 10:08:18 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 30 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 31 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 32 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Critical Description: CPU 2 machine check error detected. ------------------------------------------------------------------------------- Record: 33 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 34 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 35 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 36 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 37 Date/Time: 02/24/2024 10:08:19 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 38 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 39 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 40 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 41 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 42 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 43 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 44 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 45 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 46 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 47 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 48 Date/Time: 02/24/2024 10:08:20 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 49 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 50 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 51 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 52 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 53 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 54 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 55 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 56 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 57 Date/Time: 02/24/2024 10:08:21 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 58 Date/Time: 02/24/2024 10:08:22 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 59 Date/Time: 02/24/2024 10:08:22 Source: system Severity: Ok Description: An OEM diagnostic event occurred. -------------------------------------------------------------------------------
@wiki_willy can we contact the vendor about this issue which caused a reboot?
Record: 27 Date/Time: 02/24/2024 10:08:18 Source: system Severity: Critical Description: CPU 1 machine check error detected.
This server is in codfw. I'll get a report sent to Dell asap to get a replacement cpu
actually, this server is not in warranty. I will try to find a viable replacement from the decommissioned inventory in the morning.
Thanks for picking this up @Jhancock.wm. @Marostegui - since this host looks like it's close to being refreshed in T355350, do you want to just wait for the refreshed server to be setup instead of fixing this one? Thanks, Willy
Change 1006750 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db2118: Notes about its crash
Change 1006750 merged by Marostegui:
[operations/puppet@production] db2118: Notes about its crash
The data looks correct. I am not going to repool this host for now, I am going to wait until its replacement in T355350 gets installed and simply clone that one and decommission this one.
Mentioned in SAL (#wikimedia-operations) [2024-02-29T06:26:01Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Pool db2218 with 1% weight only T358421 T355422', diff saved to https://phabricator.wikimedia.org/P58171 and previous config saved to /var/cache/conftool/dbconfig/20240229-062601-marostegui.json
Change 1007481 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] instances.yaml: Remove db2118 from dbctl
Change 1007481 merged by Marostegui:
[operations/puppet@production] instances.yaml: Remove db2118 from dbctl
This host will no longer come back to production. I will decommission it in a couple of days.