Page MenuHomePhabricator

cloudvirt1038: PCIe error
Open, HighPublic

Description

I detected this when rebooting the server:

Enumerating Boot options...
Enumerating Boot options... Done

UEFI0067: A PCIe link training failure is observed in Slot1 and the link is
disabled.
Do one of the following: 1) Turn off the input power to the system and turn on
again. 2) Update the PCIe device firmware. If the issue persists, contact your
service provider.
 

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for Lifecycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

The server wont boot.

DC Ops troubleshooting

I've not seen this, but unless it happens twice it doesn't count! So, steps to fix:

cloudvirt1038 is located in D5:U14

  • onsite unplugs all power, fully depowering the system. Then plug back in after 15-30 seconds and attempt to boot the system. - did not fix error
  • update firmware: idrac, bios, network card (only item on pcie bus) - did not fix error
  • - open task with dell support, have them dispatch a tech (see note below)

tech note: This error will require some troubleshooting with the network card, the PCIe riser, and the mainboard. This can be done by our own on-sites, or we can leverage our warranty coverage to have Dell send out a technician. Trying to do this via self dispatch will be cumbersome, as its unclear where to start. The options are to have an on-site self dispatch parts and work with Dell support via email to determine what parts and how to fix, or to have an on-site call into dell support and schedule a dell technician to come out and work on this in warranty host to determine what part is wrong.

When the system is booting successfully again, please change the 'failed' state in netbox back to 'active'.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T12:52:32Z] <arturo> cloudvirt1038 hard powerdown / powerup for T276922

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T13:31:44Z] <arturo> hard-reboot tools-docker-registry-04 because issues related to T276922

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T13:32:03Z] <arturo> hard-reboot deployment-db05 because issues related to T276922

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T13:35:02Z] <arturo> icinga-downtime cloudvirt1038 for 30 days for T276922

aborrero triaged this task as High priority.
aborrero moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.
aborrero added projects: DC-Ops, ops-eqiad.
aborrero added a subscriber: RobH.

Hey @RobH does this sound like something we have seen before in our fleet?

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

I've not seen this, but unless it happens twice it doesn't count! So, steps to fix:

cloudvirt1038 is located in D5:U14

  • onsite unplugs all power, fully depowering the system. Then plug back in after 15-30 seconds and attempt to boot the system.
  • If that works, resolve this task as it only happened once and thus isn't worth spending more time on.
  • If that does not work, you can reassign this task back to Rob and he'll flash firmware updates across the host remotely.
  • firmware update

When the system is booting successfully again, please change the 'failed' state in netbox back to 'active'.

Cmjohnson added a subscriber: Cmjohnson.

Robh: didn't work, try updating f/w please.

Ok, Updating the firmware in this order: idrac, network card (only pcie device installed), bios.

RobH reassigned this task from RobH to Cmjohnson.EditedMar 9 2021, 8:13 PM
RobH updated the task description. (Show Details)
RobH removed a subscriber: RobH.

Updated firmwares, but error persists.

Bios now 2.10.0
idrac now 4.40.00
nic now 21.60.22.11 & 21.60.16

I've updated the checklist on the task description, along with suggested next steps:

  • - open task with dell support, have them dispatch a tech (see note below)

tech note: This error will require some troubleshooting with the network card, the PCIe riser, and the mainboard. This can be done by our own on-sites, or we can leverage our warranty coverage to have Dell send out a technician. Trying to do this via self dispatch will be cumbersome, as its unclear where to start. The options are to have an on-site self dispatch parts and work with Dell support via email to determine what parts and how to fix, or to have an on-site call into dell support and schedule a dell technician to come out and work on this in warranty host to determine what part is wrong.

I'm assigning this back to Chris since the next steps all require on-site coordination (direct troubleshooting or scheduling a dell tech).

I left the host powered off.

I see this server is down and will need to go through the process of removing all the components and adding them back 1 by 1 until I can figure out what is causing the error. This will happen this week and I will update the task once I know more.

So far I have found out nothing
This is a PCI error, in the past, this would mean a blown capacitor but this is the first I've seen of this error since we left Tampa. I attempted to remove and replace the riser cards one at a time but that did not work. I removed the NIC card since that is in a PCI slot and that did not work. My guess is the motherboard needs to be replaced. Dell's system log doesn't show an error. I need to submit a ticket to Dell and will update the task once I get a response.

Dell Ticket Created

You have successfully submitted request SR1055971142.

the Dell tech came out and replaced the motherboard, that did not fix the issue, it turns out that there is bad cable to the backplane. A new part has been ordered.

another Dell tech arrived today with what was believed to be the replacement part. The part was replaced and the error persisted. Several reboots and TSR reports later, we do not know what is going on. At what point in time reseating the CMOS battery worked but then the PCI error returned on the next reboot. The Dell technician is still on-site attempting to troubleshoot with Dell tech support now. Once something has been decided I will update the task.

Dell has us on a wild goose hunt. Responded to their questions with the following:

  1. Has this system ever had a PCI card in slot 1? If so, what card? No, the system has not had a PCI card in slot 1 or any slot for that matter.
  1. Has this system ever had a PERC / backplane drives / storage other than the BOSS card? No, additional raid controllers or backplane drives or any other storage has been added. We have multiples of the same server with the same configuration and this is the only server that has given this type of error.

3.Did the fan issue and VGA issue start after the system board was replaced? I cannot answer that question because I am not the one who replaced the board. I know that the original technician replaced it and the error persisted, he was there for several hours and was able to finally get the system to get through post and boot to the hard drive. That was when it was determined it was a riser card. Technician 2 came to replace the card and the error returned. This is who you talked to the other day.