Page MenuHomePhabricator

cp5006 unresponsive
Closed, ResolvedPublic

Description

According to racadm, cp5006 is powered off, but commands to power it on fail with a timeout. Already tried "racadm racreset". There's a raclog entry that mentions communication failure to the IME, and lookup on that message suggests pulling power cords to fix.

Event Timeline

This server is dead we need to open a case with Dell for main board replacement. When pushing the power button nothing happen on the server.

All the normal troubleshooting was done on the server.

  • Unplugging the power
  • Removing the PSU's for 15 minutes while working on the router

Server will not power on.

There was a bit of confusion on this. I've opened a self dispatch (SR961821650) to try to get the part sent out. However, as its the first Singapore dispatch, it will likely fail. The Netherlands dispatch failed until Dell tweaked our account, so I'm awaiting them to deny and fire back the reply on SR961821650.

I did request a Dell tech go, so if this arrives AFTER papaul has left, then the dell tech will have to be supervised by smarthands while they perform the mainboard swap and re-setup the drac (with a temp password.)

Reminder: after hardware level is fixed and the host is installed, we'll need to uncomment its entry in hieradata/common/cache/upload.yaml before it will successfully puppetize and join the cluster.

Ok, picking this back up today, and attempting to move it along.

I didn't attach any failure logs to this task before, so in attempting to login to the mgmt interface on cp5006.mgmt.eqsin.wmnet, I get no response via ssh or https. I can login to other cp500X systems via ssh, but not pull up their https mgmt interface.

Since I cannot login to the cp5006, getting Dell to send replacements without troubleshooting may be problematic.

cp5006.mgmt.eqsin.wmnet is unresponsive to ping/ssh/https requests, when others (like cp5005) work just fine. This is now at a state where I cannot remotely troubleshoot it further without remote/smarthands assistance. I'll detail out what steps for smarthands to take in my next comment, which @faidon can review and approve for smarthands order.

I'd like someone else to give my directions a once over and ensure I'm not missing anything. The cable info is documented on the eqsin connection tracking google sheet (not linked here since its not a public document).

Equinix Support,

We're having issues connecting to one of the servers in 06:040020:0604. The server in question is labeled 'cp5006' and is located in U21 of 06:040020:0604.

We would like a few things tested/confirmed, please review the following list of steps.

  1. Confirm cp5006 has a green network cable plugged into the drac enterprise/ilom network port on the back. This cable should be numbered with 1054 on both ends, and plugged into msw2-eqsin:port4.
  1. Confirm cable 1054 shows a link light on the switch msw2-eqsin & drac port on cp5006. If it does, please move down to step 4.
  1. If there is no link light, please replace the cable with another green cable (located in our rack at either the very top or bottom, there should be spare green network cables not in use.) Please check if a link light then shows up for port4. If not, try another open port. If that doesn't solve, move back to port4 for additional troubleshooting on the system itself.
  1. Connect a crash cart to cp5006, we will need to confirm a few settings on the system.
  1. Boot into the BIOS by power cycling the system and pressing F2 (when prompted during post) to enter bios. The bios screen also lets you modify the DRAC/ilom settings.
  1. Once in bios, select the second option 'iDRAC Settings' to confirm/edit the idrac settings.
  1. Once in the idrac settings, go down to the 'Network' option and enter it. On the next few steps (8- ), we'll be confirming settings. If any setting doesn't match the steps below, fix it so it does and let us know which settings had to be changed.
  1. On 'iDRAC Settings > Network' ensure 'Enable NIC' is Enabled.
  1. Scroll down to 'IPV4 SETTINGS' and confirm the following: Enable IPv4 Enabled, Enable DHCP Disabled, Static IP address of 10.132.129.106, Static Gateway of 10.132.128.1, & Static Subnet Mask of 255.255.128.0.
  1. Once all the above is confirmed, there should be a link light for the drac (green) network cable on the back. Please confirm all steps taken above and detail out any changes that had to be made.

After discussion with @Cmjohnson its been decided we'll go ahead and attempt to get the mainboard replaced before doing the smarthands work i suggested above. @Papaul was onsite and did the steps:

All the normal troubleshooting was done on the server.

  • Unplugging the power
  • Removing the PSU's for 15 minutes while working on the router

Server will not power on.

I'll open a case to get a new mainboard sent out with a Dell Tech as well. The notes will state they need to schedule the Dell Tech with me directly.

{F23557620}

Self dispatch SR971650695 scheduled, including a request for an onsite technician.

Once they send me the shipping info, I'll open an inbound shipment ticket with eqsin. I'll then schedule/coordinate with the Dell tech, since we'll have to pay smarthands to escort the tech to our cage and such.

BBlack raised the priority of this task from Medium to High.Jul 16 2018, 3:35 PM

Turning priority to "high" for this and the 5001 ticket ( T199675 ), as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes.

I put in the self dispatch last week, but have not gotten a reply on it. I'll fall back to simply calling into technical support daily until this gets a resolution.

They kicked back a denied and now I'm on with support to file a new task, no self dispatch. Request 975216005 filed!

Scheduling a dell technician visist scheduling via email, as it involves Dell support, and them selecting and dispatching a tech. (I have to have the tech's name 48 hours before they arrive to the site to open the smarthands escort task.) This is progressing.

Email from Dell:

Hi Rob,

The part dispatched is done and the reference number for this dispatch is DPS 91911999981.

As such, our onsite engineer will email you the security information needed to you before onsite.

Thanks.

Regards,
Isaac Khoo

So now just awaiting their local tech to email me his contact info for inclusion on the on-site/EQ ticket.

They provided me with a list of 7 names, I asked them to specify which one (or two) are going onsite. It takes 24 hours for them to get back to me from any reply.

Got it down to two names and submitting a smarthands ticket for escort on Wednesday, July 25th.

Site Visit Ticket #: 1-162553077672
SmartHands Escort Ticket #: 1-162554266089

Emailed info over to the dell tech and its scheduled for 9am this Wednesday. (They may show up later, 9am is the earliest.)

email sent to team list so all other sre team members are aware of this work next Wednesday (2018-07-25).

I think they did something, as the password for mgmt ssh appears to be reset (can't get in anymore)

Ok, so the email back from them when I woke up this AM was a bit confusing, but boils down to this:

  • They seem to have replaced the mainboard, and set the temp drac password as requested.
  • @RobH logged in via mgmt and temp pass successfully.
  • @RobH will login and re-setup and deploy this system with an OS, after auditing the system remotely.

Change 447849 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp5006: uncomment in upload hieradata list

https://gerrit.wikimedia.org/r/447849

Change 447849 merged by BBlack:
[operations/puppet@production] cp5006: uncomment in upload hieradata list

https://gerrit.wikimedia.org/r/447849

cp5006 is now installed and puppeted and in-service, should be all fixed up assuming nothing bursts into flames in the near future.