Broken CPU on mw2394
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• MoritzMuehlenhoff
	Jan 2 2024, 2:34 PM

Description

mw2394 is inaccessible and there's a CPU error logged in System Event Log:

Date/Time:   12/31/2023 19:43:14
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.

Event Timeline

• MoritzMuehlenhoff created this task.Jan 2 2024, 2:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 2 2024, 2:34 PM

Maintenance_bot added a project: SRE.Jan 2 2024, 3:29 PM

The host also broke during the MediaWiki train:

04:55:49 Started sync_wikiversions
04:55:49 sync_wikiversions:   0% (ok: 0; fail: 0; left: 374)                    
04:58:04 sudo -u mwdeploy -n -- /usr/bin/rsync -l deployment.codfw.wmnet::common/wikiversions*.{json,php} /srv/mediawiki (ran as mwdeploy@mw2394.codfw.wmnet) returned [255]: ssh: connect to host mw2394.codfw.wmnet port 22: Connection timed out

04:58:04 sync_wikiversions: 100% (in-flight: 0; ok: 373; fail: 1; left: 0)      
04:58:04 sync_wikiversions: 100% (in-flight: 0; ok: 373; fail: 1; left: 0)      

04:58:04 Finished sync_wikiversions (duration: 02m 15s)
04:58:04 1 hosts had sync_wikiversions errors

This is a blocker until the host is removed from /etc/dsh/group/scap_targets.

Mentioned in SAL (#wikimedia-operations) [2024-01-02T18:29:18Z] <mutante> confctl select 'name=mw2394.codfw.wmnet' set/pooled=inactive | T354193#9430654 - seems like 2396 was previously depooled instead of this 2394

depooled 2394 - per https://sal.toolforge.org/log/vbyWyowBxE1_1c7szGCe previously 2396 was depooled

Thanks, @Dzahn. After looking a bit more, I don't think the presence in scap_targets should affect train, so I'm deescalating this. Whether or not depooled hosts should still be present in scap_targets is up for debate.

dduvall lowered the priority of this task from Unbreak Now! to Medium.Jan 2 2024, 6:35 PM

dduvall removed a parent task: T350088: 1.42.0-wmf.12 deployment blockers.

I agree the train should be unblocked and lowering it from UBN to High seems correct.

Also that scap_targets should only influence scap deployment.

edit: well, High or Medium :)

Dzahn lowered the priority of this task from High to Medium.Jan 2 2024, 6:36 PM

Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. 	Sun 31 Dec 2023 19:43:14
	Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. 	Sun 31 Dec 2023 19:43:14
	CPU 1 machine check error detected. 	Sun 31 Dec 2023 19:43:14
	CPU 1 machine check error detected.

we are seeing some errors on DIMM B1 and CPU1
i am going to swap CPU1 with CPU2 and DIMM B1 with DIMM A1 see if we do have the error on CPU2 or DIMM A1

Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Jan 3 2024, 5:23 PM

After swapping the CPU and DIMM now i am getting

	CPU 2 MEM012 VPP PG voltage is outside of range. 	Wed 03 Jan 2024 17:43:07
	CPU 1 MEM012 VPP PG voltage is outside of range.

and the server is no longer powering up
i will put in an order for Dell to send us a main board

Create Dispatch: Success
You have successfully submitted request SR182660280.

hashar unsubscribed.Jan 3 2024, 7:48 PM

Your dispatch shipped on 1/3/2024 4:20 PM

mainboard repalced by @Jhancock.wm . She is running the provision cookbook now.

@Jhancock.wm what i did for the provision cookbook to PASSws to reset the IDRAC password and re-run the cookbook again
@Dzahn the host is backup .

mw2394 squared up and repooled, set back in active in Netbox

Icinga downtime and Alertmanager silence (ID=87150827-7740-4075-ada6-a08469c8b7f6) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bad DIMM

mw2394.codfw.wmnet

Reopening this task since hardware failures for this server happened very close to each other. mw2394 crashed this morning due to a DIMM error

-------------------------------------------------------------------------------
Record:      1022
Date/Time:   01/17/2024 11:46:21
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      1023
Date/Time:   01/17/2024 11:46:22
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_B3. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      1024
Date/Time:   01/17/2024 11:46:22
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_B3. Immediately replace the DIMM.
-------------------------------------------------------------------------------

Downtimed, removed from pool and marked as Failed in Netbox.

11:59 <+logmsgbot> !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=mw2394.codfw.wmnet

I will check on this this morning. thank you for depooling

I moved the DIMM to a different slot and the error moved with it. I've put in a dispatch with Dell. SR183504113

Thank you @Jhancock.wm

@Clement_Goubert I replaced the DIMM and the error has cleared. You should be able to add back.

Repooled, thank you @Jhancock.wm

Broken CPU on mw2394Closed, ResolvedPublicActions

Description

Event Timeline

Broken CPU on mw2394
Closed, ResolvedPublic
Actions