Page MenuHomePhabricator

hw troubleshooting: system right cp board missing in new host backup1006
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.

backup1006.eqiad.wmnet

  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.

new host, set role(insetup)

  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)

new host in a group of 4 new hosts, so not unbreak before other in service groiups but this blocks the host from full confidence deployment

Date/Time:   06/23/2021 15:57:13
Source:      system
Severity:    Critical
Description: The System Board CP Right is absent.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   06/25/2021 13:07:15
Source:      system
Severity:    Critical
Description: The System Board CP Right is absent.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/02/2021 14:15:57
Source:      system
Severity:    Critical
Description: The System Board CP Right is absent.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   06/08/2021 00:19:19
Source:      system
Severity:    Critical
Description: The System Board CP Right is absent.
-------------------------------------------------------------------------------

This is likely due to a cable or board being unseated during shipment, an onsite will need to pop this and check the connector for the connector for this.

  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
  • - once the error has been cleared, set the host to 'staged' in netbox and resolve this task so the Data-Persistence-Backup folks know they can have this host.

Event Timeline

Assigning this to Chris for him to pop this chassis open and investigate if everything is seated. He returns from his vacation before John returns from his, but either of them could investigate this.

Since this is a new host, I suspect this is just unseated due to shipment.

LSobanski added a subscriber: jcrespo.
LSobanski subscribed.

Adding @jcrespo for visibility.

I will be able to take a look at this later today or first thing tomorrow. I briefly looked yesterday but it's one of the new servers that have a front bank of disk and back bank and figuring out how to open it up will take a little time.

I am shutting it down and downtiming it until Monday just in case.

I am extending the downtime for a week from now so it doesn't alert while shutdown.

I am extending the downtime for an extra week.

I just noticed backup1006.mgmt no longer getting ping since yesterday, did some maintenance started or is this a real connectivity issue (so I can acknoledge it on icinga)?

I’m sorry, I took it down to work on the server and was sidetracked with
immediate network repairs and forgot to get back to it. I will take care
of it today.

I pulled the system out of the rack, checked the seating for all connections. Cleared the log and powercycled. Let's see if the error returns. I have never seen this message and I am not 100% sure what it means. If it comes back I will need to open a ticket with Dell.

The error has not returned, if it appears again please re-open and ping me.