Page MenuHomePhabricator

mw2286 stuck after reboot
Closed, ResolvedPublic

Description

Host mw2286 ( codfw row D / D4) is stuck after reboot. The host did not recovered properly from reboot (no ssh or network connectivity on main interface). The hosts is responding on mw2286.mgmt.codfw.wmnet. A racadm serveraction powercycle did not help to properly reboot the server.

racadm lists some critical errors:

/admin1-> racadm getsel
Record:      1
Date/Time:   02/19/2018 17:02:49
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   03/29/2018 20:01:21
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A1.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   03/29/2018 20:01:21
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A1.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   08/17/2018 10:27:03
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      5
Date/Time:   08/17/2018 10:27:03
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/01/2021 20:26:06
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/01/2021 20:26:06
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   04/25/2022 14:30:42
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      9
Date/Time:   04/25/2022 14:30:42
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   04/25/2022 15:04:47
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      11
Date/Time:   04/25/2022 15:04:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   04/25/2022 15:42:32
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      13
Date/Time:   04/25/2022 15:42:32
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------

Event Timeline

20:27 <RoanKattouw> mw2286 timed out during my deployments BTW , is that a known issue?

^ to avoid these we would also have to remove them from scap ("dsh") groups by setting them to "pooled=inactive"


<+logmsgbot> !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: dc=codfw,name=mw2286.codfw.wmnet

dcops: The host is depooled, you can work on this any time you want to.

Dzahn triaged this task as Medium priority.Apr 25 2022, 11:02 PM
 	Mon Apr 25 2022 15:42:32	Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.	
	 	Mon Apr 25 2022 15:42:32	A problem was detected in Memory Reference Code (MRC).	
	 	Mon Apr 25 2022 15:04:47	Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.	
	 	Mon Apr 25 2022 15:04:47	A problem was detected in Memory Reference Code (MRC).	
	 	Mon Apr 25 2022 14:30:42	Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.	
	 	Mon Apr 25 2022 14:30:42	A problem was detected in Memory Reference Code (MRC).	
	 	Mon Feb 01 2021 20:26:06	Correctable memory error rate exceeded for DIMM_A1.	
	 	Mon Feb 01 2021 20:26:06	Correctable memory error rate exceeded for DIMM_A1.	
	 	Fri Aug 17 2018 10:27:03
Papaul claimed this task.

Replaced DIMM-A1 server is back up.

Thank you!

Server is still depooled though. Similarly to racking tasks this needs some agreement on the workflow. Like either the tickets should come back to us or we need to create new tickets for follow-up or you would have to do the pool. The current way it's very easy for us to forget that because the ticket is already closed.

Mentioned in SAL (#wikimedia-operations) [2022-05-02T18:06:46Z] <mutante> repooling mw2286 after hardware repair - T306823