Page MenuHomePhabricator

labstore1005 A PCIe link training failure error on boot
Open, Stalled, HighPublic

Description

labstore1005 has displayed this error on boot. It booted normally after pressing F1 but it's unclear if the system will reboot successfully without manual intervention.

UEFI0067: A PCIe link training failure is observed in PCIe Slot 6 and the link
is disabled.
Do one of the following: 1) Turn off the input power to the system and turn on
again. 2) Update the PCIe device firmware. If the issue persists, contact your
service provider.

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for LifeCycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

Event Timeline

herron created this task.Jun 30 2017, 1:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2017, 1:52 AM
herron updated the task description. (Show Details)Jun 30 2017, 1:53 AM
Andrew added a subscriber: Andrew.

I tagged dc-ops because... have y'all ever seen something like this?

bd808 added a subscriber: bd808.Jun 30 2017, 2:07 AM

http://www.dell.com/support/manuals/us/en/04/dell-opnmang-sw-v8.1/EEMI_13G_v1.2-v1/UEFI-Event-Messages?guid=GUID-823669E3-2D7B-41B5-85F1-AF7A6BC11ACC&lang=en-us

UEFI0067

Message
    A PCIe link training failure is observed in arg1 and device link is disabled. 
Arguments
    arg1 = PCIe device 
Detailed Description
    A PCIe link failure is observed in the PCIe device identified in the message and device link is disabled. 
Recommended Response Action
    Do one of the following: 1) Turn off the input power to the system and turn on again. 2) Update the PCIe device firmware. If the issue persists, contact your service provider. 
Category
    System Health (UEFI = UEFI Event) 
Severity
    Severity 1 (Critical)

We did another reboot to downgrade the kernel back to 4.3 and the error happened again.

chasemp triaged this task as High priority.EditedJul 3 2017, 5:47 PM
chasemp added subscribers: Christopher, chasemp.

I tagged dc-ops because... have y'all ever seen something like this?

We had this server not come back with reboot a few times on its own so we are a bit scared of it atm :)

@Cmjohnson I ping'd the wrong chris before :) As of this moment labstore1005 is the standby, if you have time to look at this it would be great chris. Thanks.

Bstorm added a subscriber: Bstorm.

This is pretty old. We'll have to reboot it again to know if this is still happening. I suspect it actually isn't.

This should be watched for during the upgrade process.

RobH assigned this task to Bstorm.Apr 13 2020, 3:34 PM
RobH added a subscriber: RobH.

Please note this was NOT in ops-eqiad, and was likely being overlooked by onsites in eqiad due to that reason. (It also is not assigned to anyone, so no one is touching it.)

@Bstorm: I'm assigning this to you for feedback, please advise and detail what you need from DC ops for this task. As of now, there is no direct ask other than 'have you seen anything like this' but this is also a very very old task.

Please advise on the above and if this needs a specific person to review it, please assign to them for feedback.

JHedden changed the task status from Open to Stalled.May 5 2020, 4:30 PM
JHedden added a subscriber: JHedden.

Waiting for the next reboot of this host

RobH removed a subscriber: RobH.May 5 2020, 8:01 PM

Managed to get the error after 2 more reboots!

UEFI0067: A PCIe link training failure is observed in PCIe Slot 6 and the link
is disabled.
Do one of the following: 1) Turn off the input power to the system and turn on
again. 2) Update the PCIe device firmware. If the issue persists, contact your
service provider.

Hitting F1 and letting it boot resulted in a totally working system. I'm not sure I care enough to fight with it again until refresh.