mw2394 is inaccessible and there's a CPU error logged in System Event Log:
Date/Time: 12/31/2023 19:43:14 Source: system Severity: Critical Description: CPU 1 machine check error detected.
mw2394 is inaccessible and there's a CPU error logged in System Event Log:
Date/Time: 12/31/2023 19:43:14 Source: system Severity: Critical Description: CPU 1 machine check error detected.
The host also broke during the MediaWiki train:
04:55:49 Started sync_wikiversions 04:55:49 sync_wikiversions: 0% (ok: 0; fail: 0; left: 374) 04:58:04 sudo -u mwdeploy -n -- /usr/bin/rsync -l deployment.codfw.wmnet::common/wikiversions*.{json,php} /srv/mediawiki (ran as mwdeploy@mw2394.codfw.wmnet) returned [255]: ssh: connect to host mw2394.codfw.wmnet port 22: Connection timed out 04:58:04 sync_wikiversions: 100% (in-flight: 0; ok: 373; fail: 1; left: 0) 04:58:04 sync_wikiversions: 100% (in-flight: 0; ok: 373; fail: 1; left: 0) 04:58:04 Finished sync_wikiversions (duration: 02m 15s) 04:58:04 1 hosts had sync_wikiversions errors
Mentioned in SAL (#wikimedia-operations) [2024-01-02T18:29:18Z] <mutante> confctl select 'name=mw2394.codfw.wmnet' set/pooled=inactive | T354193#9430654 - seems like 2396 was previously depooled instead of this 2394
depooled 2394 - per https://sal.toolforge.org/log/vbyWyowBxE1_1c7szGCe previously 2396 was depooled
Thanks, @Dzahn. After looking a bit more, I don't think the presence in scap_targets should affect train, so I'm deescalating this. Whether or not depooled hosts should still be present in scap_targets is up for debate.
I agree the train should be unblocked and lowering it from UBN to High seems correct.
Also that scap_targets should only influence scap deployment.
edit: well, High or Medium :)
Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. Sun 31 Dec 2023 19:43:14 Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. Sun 31 Dec 2023 19:43:14 CPU 1 machine check error detected. Sun 31 Dec 2023 19:43:14 CPU 1 machine check error detected.
we are seeing some errors on DIMM B1 and CPU1
i am going to swap CPU1 with CPU2 and DIMM B1 with DIMM A1 see if we do have the error on CPU2 or DIMM A1
After swapping the CPU and DIMM now i am getting
CPU 2 MEM012 VPP PG voltage is outside of range. Wed 03 Jan 2024 17:43:07 CPU 1 MEM012 VPP PG voltage is outside of range.
and the server is no longer powering up
i will put in an order for Dell to send us a main board
@Jhancock.wm what i did for the provision cookbook to PASSws to reset the IDRAC password and re-run the cookbook again
@Dzahn the host is backup .
Icinga downtime and Alertmanager silence (ID=87150827-7740-4075-ada6-a08469c8b7f6) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bad DIMM
mw2394.codfw.wmnet
Reopening this task since hardware failures for this server happened very close to each other. mw2394 crashed this morning due to a DIMM error
------------------------------------------------------------------------------- Record: 1022 Date/Time: 01/17/2024 11:46:21 Source: system Severity: Ok Description: A problem was detected in Memory Reference Code (MRC). ------------------------------------------------------------------------------- Record: 1023 Date/Time: 01/17/2024 11:46:22 Source: system Severity: Critical Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_B3. Immediately replace the DIMM. ------------------------------------------------------------------------------- Record: 1024 Date/Time: 01/17/2024 11:46:22 Source: system Severity: Critical Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_B3. Immediately replace the DIMM. -------------------------------------------------------------------------------
Downtimed, removed from pool and marked as Failed in Netbox.
11:59 <+logmsgbot> !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=mw2394.codfw.wmnet
I moved the DIMM to a different slot and the error moved with it. I've put in a dispatch with Dell. SR183504113
@Clement_Goubert I replaced the DIMM and the error has cleared. You should be able to add back.