Page MenuHomePhabricator

cr1-codfw:fpc0 failure
Closed, ResolvedPublic

Description

re0.cr1-codfw> show system alarms    
2 alarms currently active
Alarm time               Class  Description
2020-05-31 21:28:40 UTC  Minor  FPC 0 Temp Sensor Fail
2020-05-31 20:00:50 UTC  Major  FPC 0 Hard errors
re0.cr1-codfw> show interfaces descriptions 
Interface       Admin Link Description
et-0/0/0                   Core: asw-a-codfw:et-2/0/52 {#10702} [40Gbps DF]
et-0/0/1                   Core: asw-b-codfw:et-2/0/51 {#10703} [40Gbps DF]
et-0/2/0                   Core: asw-c-codfw:et-2/0/51 {#10704} [40Gbps DF]
et-0/2/1                   Core: asw-d-codfw:et-2/0/51 {#10705} [40Gbps DF]
re0.cr1-codfw> show chassis fpc      
                     Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer
  0  Present          Absent
  1  Empty           
  2  Empty           
  3  Empty           
  4  Empty           
  5  Online            37     25          1       24     25     25    2048       40         29
cr1-codfw> show chassis hardware
FPC 0            REV 14   750-045372   [REDACTED]          MPCE Type 3 3D
  CPU            REV 10   711-035209   [REDACTED]          HMPC PMB 2G

VRRP failed over cleanly to cr2-codfw, no reasons to depool codfw but the linecard needs to be fixed/replaced ASAP.

Related Objects

StatusSubtypeAssignedTask
Resolvedayounsi

Event Timeline

ayounsi triaged this task as High priority.May 31 2020, 9:32 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Opened JTAC case 2020-0531-0098.

Logs and RSI attached to the case.

I checked the log messages and could see “I2c slave read back errors” on this FPC:

May 31 20:00:08  re0.cr1-codfw chassisd[4838]: CHASSISD_I2CS_READBACK_ERROR: Readback error from I2C slave forFPC 0 ([0x12, 0x21] -> 0x0)
May 31 20:00:08  re0.cr1-codfw kernel: PCF8584(WR): target ack failure on byte 0

“This ‘I2C slave’ errors are logged when the chassis process (chassisd) cannot read back information from the I2C slave (I2CS) about the indicated component (field-replaceable unit, or FRU).
The I2C master initiates all communication on the I2C bus and supplies the clock for all slave devices. If there is an issue with the I2C slave, it disrupts communication with the I2C bus and the device ID. Data is disrupted on the FRU component. The chassisd will identify the failed FRU component in the log messages”.
In order to know whether or not this is a permanent problem, please proceed with the following steps:

  1. Reboot fpc: ''request chassis fpc slot 0 restart'' ----> If alarm is not cleared, then:
  2. Physically reseat (Jack In and Jack Out) this FPC card connected on the slot 0. ----> If alarm is not cleared, then:
  3. RMA would need to be created.

I asked for a RMA directly as even if a reboot/reseat solves the issue there is no guarantee that the issue won't re-happen soon.
Waiting for JTAC.

ayounsi added a subscriber: Papaul.

JTAC doesn't want to RMA it without a restart/reseat.
Restart didn't help.

@Papaul can you re-seat cr1-codfw:fpc0 asap?

ayounsi mentioned this in Unknown Object (Task).

Opened remote hands request instead T254136.

Mentioned in SAL (#wikimedia-operations) [2020-06-01T17:47:07Z] <XioNoX> turn online cr1-codfw:fpc0 - T254110

The linecard went through those states:
0 Present Testing
0 Offline ---Unresponsive---
0 Present Absent
0 Offline ---Unresponsive---
And seems to be flapping between the last 2 status.

Followed up with Juniper and requested a RMA.

@RobH to be ahead of Juniper, they will need the following to issue the RMA and know where to ship the part.
Note that the task is public, feel free to make it SRE only or email me the info.
Linecard replacement is easy so it's better to have remote hands do it, and then up to DC Ops if remote hands ships the broken linecard back or wait for Papaul.

Please share with me following details to start RMA process:
-Name of the point of contact:
-Email of the point of contact:
-Phone number of the point of contact:
-Shipping address:
-Street:
-City:
-State:
-Zip Code:
-Country:

https://netbox.wikimedia.org/dcim/sites/esams/

point of contact is just 'iron mountain shipping' and the generic contact number.

That has all the info I think you are requesting? We will need to open in inbound shipment ticket as well once things ship.

IRC update: codfw not esams, heh..

https://netbox.wikimedia.org/dcim/sites/codfw/

I would list the generic info for their NOC though, which I'll append to that entry now.

ayounsi mentioned this in Unknown Object (Task).Jun 2 2020, 5:57 AM
ayounsi added a subtask: Unknown Object (Task).
ayounsi added a subtask: Unknown Object (Task).

Mentioned in SAL (#wikimedia-operations) [2020-06-03T05:14:02Z] <XioNoX> turn cr1-codfw:fpc0 online - T254110

The FPC is up and healthy. Interfaces are up as well.
Netbox updated with the new serial#.

RobH closed subtask Unknown Object (Task) as Resolved.Jun 29 2020, 4:38 PM