Page MenuHomePhabricator

decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset)
Closed, ResolvedPublic

Description

Hi!

I found mw2280 among the icinga alarms this morning, the host was stuck (no tty in mgmt console, unresponsive to ssh, etc..) so I tried to powercycle (but the command timedout) and hardreset (succeeded but with no effect).
I didn't find anything useful in racadm getsel, but it seems that the host is in trouble.

Event Timeline

The host is currently with status inactive so we can do maintenance anytime!

I was able to ssh to mgmt and I got this from racadm getsel:

cmdstat
	status       : 2
	status_tag   : COMMAND PROCESSING FAILED
	error        : 253
	error_tag    : COMMAND NOT RECOGNIZED
/admin1-> racadm getsel
Record:      1
Date/Time:   02/19/2018 17:32:22
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   09/10/2021 07:26:20
Source:      system
Severity:    Critical
Description: CPU 1 M01 VPP PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   09/10/2021 07:26:21
Source:      system
Severity:    Critical
Description: CPU 1 M23 VPP PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   09/10/2021 07:26:21
Source:      system
Severity:    Critical
Description: The system board 5V SWITCH PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   09/10/2021 07:26:21
Source:      system
Severity:    Critical
Description: The system board 1.05V PG voltage is outside of range.
-------------------------------------------------------------------------------
wiki_willy subscribed.

Just a heads up - Papaul is on paternity leave for a couple weeks, but let me know if this becomes urgent and we need to involve smart hands on anything. Thanks, Willy

Papaul triaged this task as Medium priority.Sep 27 2021, 3:05 AM
Clear Log
Save As
	
	 	Mon Sep 27 2021 14:13:45	The system board 5V SWITCH PG voltage is outside of range.	
	 	Mon Sep 27 2021 14:13:37	The system board fail-safe voltage is outside of range.	
	 	Mon Sep 27 2021 14:13:30	The system board 5V SWITCH PG voltage is outside of range.	
	 	Fri Sep 10 2021 07:26:21	The system board 1.05V PG voltage is outside of range.	
	 	Fri Sep 10 2021 07:26:21	The system board 5V SWITCH PG voltage is outside of range.	
	 	Fri Sep 10 2021 07:26:21	CPU 1 M23 VPP PG voltage is outside of range.

I looked at this server today and did some power drain as well it looks like a main board issue to me on this server.
@wiki_willy the server is out of warranty since 02/2021

Hi @Papaul - do you have any decom'd servers around to replace this one? If not, we can either see if Service Ops would ok decommissioning it a couple years before its next refresh, or we can purchase a new board to keep this one running. Thanks, Willy

@wiki_willy yes we do have 10 decom'd servers of the same type.

@wiki_willy bad news, mw2280 and the 10 decom'd servers all have same main board the only difference is the connector on the cable that connects the power button to the main board so we can not use any of the decom'd servers.

Ok, thanks for checking on that @Papaul. Before ordering a new motherboard, maybe the next step is to check with Service-Ops if they can afford to decom this server. Thanks, Willy

@Joe this server is out of warranty and it has a main board problem. Do you think we can decom this server or do we have to buy a new main board to keep the server in production?

Thanks

Papaul lowered the priority of this task from Medium to Lowest.Oct 21 2021, 2:24 PM
Dzahn raised the priority of this task from Lowest to Low.Oct 21 2021, 6:00 PM

I think we can live without it and it's right to lower prio. ACK

Change 732756 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] conftool-data: remove mw2280.codfw.wmnet

https://gerrit.wikimedia.org/r/732756

Dzahn removed Dzahn as the assignee of this task.Oct 26 2021, 5:51 PM

Chatted about this a bit. While it's low prio we would like the server back eventually. Whether it's through buying a mainboard or just replacing it with a new server.

Change 732756 abandoned by Dzahn:

[operations/puppet@production] conftool-data: remove mw2280.codfw.wmnet

Reason:

we do want it back, though with low priority, in about 3 months or so

https://gerrit.wikimedia.org/r/732756

Just a quick update - Wolfgang requested that we buy a replacement server for this, so currently working on getting the budget approval to get it procured. Thanks, Willy

I set this device's status to "Failed" in Netbox as it was triggering alerts.

But if we're replacing this with a new server, then let's decom the broken mw2280?

Yea, so, this pretty much changed back to "as long as we are buying it again next time we are refreshing". So we can just turn this into decom either way. ACK

Dzahn renamed this task from mw2280 unresponsive to powercycle and hardreset to decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset).Nov 18 2021, 5:51 PM
Dzahn removed Papaul as the assignee of this task.
Dzahn raised the priority of this task from Low to Medium.
Dzahn added a project: serviceops.

correct me if i'm wrong @wiki_willy

Hi @Dzahn - there's an email thread with Alex, Lukasz, Faidon, and Mark around whether or not it's worth the extra cycles needed to replace mw2280. I don't think they've come to a final conclusion yet, but we're basically waiting on that, before proceeding forward.

correct me if i'm wrong @wiki_willy

Thanks Willy! Alright, will wait for that.

Regardless of the outcome we would remove the existing broken one, I think.

Regardless of the outcome we would remove the existing broken one, I think.

Yeah, independent of a replacement the server should be decommed if broken without a path towards fixing it.

Change 740240 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] conftool-data: remove mw2280

https://gerrit.wikimedia.org/r/740240

server was in zombie state. somehow already removed from puppetdb and icinga but still "found physical host" when I ran the decom cookbook just in case (because there was also nothing logged about that being run before, at least not in here)

Disable and reset vlan on asw-d3-codfw:ge-3/0/9 for local eno1
Delete IP 10.192.48.102/22 on eno1
Delete IP 2620:0:860:104:10:192:48:102/64 on eno1
Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
Sleeping for 20s to avoid race conditions...
Removed host mw2280.codfw.wmnet from Debmonitor
Removed from DebMonitor
----- OUTPUT of 'puppet node clea...2280.codfw.wmnet' -----
Notice: Revoked certificate with serial 7499
Notice: Removing file Puppet::SSL::Certificate mw2280.codfw.wmnet at '/var/lib/puppet/server/ssl/ca/signed/mw2280.codfw.wmnet.pem'
mw2280.codfw.wmnet
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...2280.codfw.wmnet'.
----- OUTPUT of 'puppet node deac...2280.codfw.wmnet' -----
Submitted 'deactivate node' for mw2280.codfw.wmnet with UUID 84b7df31-6ebe-4df6-9cb2-a3a772336aa6

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: mw2280.codfw.wmnet

  • mw2280.codfw.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for mw2280.mgmt.codfw.wmnet failed (exit=1): b''
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 740240 merged by Dzahn:

[operations/puppet@production] site/conftool-data: remove mw2280

https://gerrit.wikimedia.org/r/740240

As you can see above I ran the decom cookbook. Some things were somehow already removed. Others were not. It removed it from debmonitor and DNS now. It was already dropped from puppet db. Failed to shutdown because it's already down.

From our side this host is gone now. It can be physically removed from the rack.

We 've discussed this in last week's meeting. For now, it looks like we will not be replacing this hardware as it is not worth it. We will need however make sure to account for its capacity when the time comes for the refresh of the batch it belong to.

Decommission compete

@akosiaris @Papaul thanks! ACK, we are done here then :))