Page MenuHomePhabricator

Broken CPU on mw2394
Closed, ResolvedPublic

Description

mw2394 is inaccessible and there's a CPU error logged in System Event Log:

Date/Time:   12/31/2023 19:43:14
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.

Event Timeline

The host also broke during the MediaWiki train:

04:55:49 Started sync_wikiversions
04:55:49 sync_wikiversions:   0% (ok: 0; fail: 0; left: 374)                    
04:58:04 sudo -u mwdeploy -n -- /usr/bin/rsync -l deployment.codfw.wmnet::common/wikiversions*.{json,php} /srv/mediawiki (ran as mwdeploy@mw2394.codfw.wmnet) returned [255]: ssh: connect to host mw2394.codfw.wmnet port 22: Connection timed out

04:58:04 sync_wikiversions: 100% (in-flight: 0; ok: 373; fail: 1; left: 0)      
04:58:04 sync_wikiversions: 100% (in-flight: 0; ok: 373; fail: 1; left: 0)      

04:58:04 Finished sync_wikiversions (duration: 02m 15s)
04:58:04 1 hosts had sync_wikiversions errors
dduvall triaged this task as Unbreak Now! priority.EditedJan 2 2024, 6:18 PM
dduvall subscribed.

This is a blocker until the host is removed from /etc/dsh/group/scap_targets.

Mentioned in SAL (#wikimedia-operations) [2024-01-02T18:29:18Z] <mutante> confctl select 'name=mw2394.codfw.wmnet' set/pooled=inactive | T354193#9430654 - seems like 2396 was previously depooled instead of this 2394

Thanks, @Dzahn. After looking a bit more, I don't think the presence in scap_targets should affect train, so I'm deescalating this. Whether or not depooled hosts should still be present in scap_targets is up for debate.

dduvall lowered the priority of this task from Unbreak Now! to Medium.Jan 2 2024, 6:35 PM
Dzahn raised the priority of this task from Medium to High.EditedJan 2 2024, 6:36 PM

I agree the train should be unblocked and lowering it from UBN to High seems correct.

Also that scap_targets should only influence scap deployment.

edit: well, High or Medium :)

Dzahn lowered the priority of this task from High to Medium.Jan 2 2024, 6:36 PM
Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. 	Sun 31 Dec 2023 19:43:14
	Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. 	Sun 31 Dec 2023 19:43:14
	CPU 1 machine check error detected. 	Sun 31 Dec 2023 19:43:14
	CPU 1 machine check error detected.

we are seeing some errors on DIMM B1 and CPU1
i am going to swap CPU1 with CPU2 and DIMM B1 with DIMM A1 see if we do have the error on CPU2 or DIMM A1

After swapping the CPU and DIMM now i am getting

	CPU 2 MEM012 VPP PG voltage is outside of range. 	Wed 03 Jan 2024 17:43:07
	CPU 1 MEM012 VPP PG voltage is outside of range.

and the server is no longer powering up
i will put in an order for Dell to send us a main board

Create Dispatch: Success
You have successfully submitted request SR182660280.
Your dispatch shipped on 1/3/2024 4:20 PM

mainboard repalced by @Jhancock.wm . She is running the provision cookbook now.

Papaul claimed this task.

@Jhancock.wm what i did for the provision cookbook to PASSws to reset the IDRAC password and re-run the cookbook again
@Dzahn the host is backup .

mw2394 squared up and repooled, set back in active in Netbox

Icinga downtime and Alertmanager silence (ID=87150827-7740-4075-ada6-a08469c8b7f6) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bad DIMM

mw2394.codfw.wmnet

Reopening this task since hardware failures for this server happened very close to each other. mw2394 crashed this morning due to a DIMM error

-------------------------------------------------------------------------------
Record:      1022
Date/Time:   01/17/2024 11:46:21
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      1023
Date/Time:   01/17/2024 11:46:22
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_B3. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      1024
Date/Time:   01/17/2024 11:46:22
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_B3. Immediately replace the DIMM.
-------------------------------------------------------------------------------

Downtimed, removed from pool and marked as Failed in Netbox.

11:59 <+logmsgbot> !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=mw2394.codfw.wmnet

I will check on this this morning. thank you for depooling

I moved the DIMM to a different slot and the error moved with it. I've put in a dispatch with Dell. SR183504113

@Clement_Goubert I replaced the DIMM and the error has cleared. You should be able to add back.