Page MenuHomePhabricator

cr2-codfw:fpc0 crash
Closed, ResolvedPublic

Description

cr2-codfw> show system alarms 
4 alarms currently active
Alarm time               Class  Description
2021-07-21 18:14:31 UTC  Minor  FPC 0 Temp Sensor Fail
2021-07-21 17:54:23 UTC  Major  FPC 0 Hard errors
2021-07-21 17:53:48 UTC  Major  FPC 0 offlined due to unreachable destinations
2021-07-21 17:53:38 UTC  Major  FPC 0 has unreachable destinations

Current status is loss of redundancy to all the codfw rows:

re0.cr2-codfw> show interfaces descriptions  
Interface       Admin Link Description 
et-0/0/0                   Core: asw-a-codfw:et-7/0/52 {#10706} 
et-0/0/1                   Core: asw-b-codfw:et-7/0/52 {#10707} 
et-0/2/0                   Core: asw-c-codfw:et-7/0/52 {#10708} 
et-0/2/1                   Core: asw-d-codfw:et-7/0/52 {#10709}

Logs are full of:

Jul 21 18:00:00  re0.cr2-codfw kernel: Resil 12316 IIC-SIG sent 12:54:a1:00 00000000 
Jul 21 18:00:00  re0.cr2-codfw chassisd[12316]: CHASSISD_I2CS_READBACK_ERROR: Readback error from I2C slave for FPC 0 ([0x12, 0x23] -> 0x0) 
Jul 21 18:00:00  re0.cr2-codfw kernel: PCF8584(WR): target ack failure on byte 0 
Jul 21 18:00:00  re0.cr2-codfw kernel: PCF8584(WR): (i2c_s1=0x08, group=0x12, device=0x54) 
Jul 21 18:00:00  re0.cr2-codfw kernel: Resil 12316 IIC-SIG sent 12:54:23:00 00000000 
Jul 21 18:00:00  re0.cr2-codfw kernel: PCF8584(WR): target ack failure on byte 0 
Jul 21 18:00:00  re0.cr2-codfw kernel: PCF8584(WR): (i2c_s1=0x08, group=0x12, device=0x54) 
Jul 21 18:00:00  re0.cr2-codfw kernel: Resil 12316 IIC-SIG sent 12:54:23:00 00000000

Tried to reboot it but doesn't work:

re0.cr2-codfw> request chassis fpc slot 0 restart  
FPC 0 is in transition, try again

Seems stuck in a reboot loop:

ayounsi@re0.cr2-codfw> show chassis fpc 0    
                     Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer
  0  Present          Absent

{master}
ayounsi@re0.cr2-codfw> show chassis fpc 0    
                     Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer
  0  Offline         ---Unresponsive---

Opening a JTAC case for possible RMA.

Details

Other Assignee
cmooney

Event Timeline

ayounsi triaged this task as High priority.Jul 21 2021, 6:29 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Case ID 2021-0721-0486 has been created for you.

First related log I can find referencing FPC (ae interface down logs were before).

Jul 21, 2021 @ 17:53:34.000 CMTFPC: Fabric request time out pfe 0 plane 1 pg 0, trying recovery. Check recovery count.

Email from JTAC

Please perform a physical re-seat of the card. Remove it and insert it back into the chassis.

 

If this doesn’t work, we’ll proceed with a replacement.

Mentioned in SAL (#wikimedia-operations) [2021-07-22T08:34:36Z] <XioNoX> cr2-codfw> request chassis fpc slot 0 offline - T287110

Mentioned in SAL (#wikimedia-operations) [2021-07-22T09:11:46Z] <XioNoX> depool eqiad to reduce load on one codfw-eqiad link - T287110

@Papaul replaced card and interfaces have been switched up. All seems ok.

cmooney@re0.cr2-codfw> show chassis fpc pic-status 0                      
Slot 0   Online       MPCE Type 3 3D                                
  PIC 0  Online       2X40GE QSFPP
  PIC 2  Online       2X40GE QSFPP
cmooney@re0.cr2-codfw> show interfaces descriptions | match et-0/ 
et-0/0/0        up    up   Core: asw-a-codfw:et-7/0/52 {#10706}
et-0/0/1        up    up   Core: asw-b-codfw:et-7/0/52 {#10707}
et-0/2/0        up    up   Core: asw-c-codfw:et-7/0/52 {#10708}
et-0/2/1        up    up   Core: asw-d-codfw:et-7/0/52 {#10709}
cmooney@re0.cr2-codfw> show interfaces diagnostics optics et-0/2/0 | match "receiver power" | match dBm 
    Laser receiver power                      :  0.173 mW / -7.62 dBm
    Laser receiver power                      :  0.188 mW / -7.26 dBm
    Laser receiver power                      :  0.209 mW / -6.81 dBm
    Laser receiver power                      :  0.207 mW / -6.84 dBm
cmooney@re0.cr2-codfw> show ospf interface | match "ae[1-4]" 
ae1.2001            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae1.2017            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae1.2201            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae1.402             PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1
ae2.2002            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae2.2018            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae2.2118            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae2.2120            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae2.2122            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae3.2003            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae3.2019            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae4.2004            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
ae4.2020            DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
cmooney@re0.cr2-codfw> show system alarms 
No alarms currently active

Will do a few more checks then look at re-pooling eqiad.

Mentioned in SAL (#wikimedia-operations) [2021-07-23T19:02:09Z] <topranks> De-pooling eqiad again after successful replacement of linecard in cr2-codfw T287110

Mentioned in SAL (#wikimedia-operations) [2021-07-23T19:11:30Z] <topranks> Successfully re-pooled eqiad - reversed change from yesterday after successful line card replacement in cr2-codfw - T287110

Everything still looking good, eqiad re-pooled and combined stats across sites as they were but eqiad back in the pool.

Resolving task.

Shipped out faulty line card today. Tracking information below