Page MenuHomePhabricator

cr2-esams:FPC0 Parity error
Closed, ResolvedPublic

Description

re0.cr2-esams> show system alarms                                   
1 alarms currently active
Alarm time               Class  Description
2022-09-27 17:39:04 UTC  Minor  FPC 0 Minor Errors
relevant logs
Sep 27 17:39:04  re0.cr2-esams fpc0 XMCHIP(0): CALD4521: XMCHIP(0): DDRIF: Checksum error for WO1 - Channel 15, Address 0xd002a, Checksum Errors 1, Checksum Poison Count 1
Sep 27 17:39:04  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): DDRIF: 2x32 Instance 3, Part 1 - Write rate: 3516401, Read rate: 3516393, Total rate: 7032794
Sep 27 17:39:04  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): DDRIF: Packet memory - Write rate: 65270666, Read rate: 65270576, Total rate: 130541242
Sep 27 17:39:04  re0.cr2-esams chassisd[29541]: CHASSISD_FPC_ASIC_ERROR: <FPC 0> ASIC Error detected errorno 0x0007028a (null)
Sep 27 17:39:04  re0.cr2-esams alarmd[30723]: Alarm set: FPC id=167772264, color=YELLOW, class=CHASSIS, reason=FPC 0 Minor Errors
Sep 27 17:39:04  re0.cr2-esams craftd[29545]:  Minor alarm set, FPC 0 Minor Errors
Sep 27 17:39:04  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): DDRIF: Linkram memory - Write rate: 0, Read rate: 0, Total rate: 0
Sep 27 17:39:04  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): WO1: Packet error - Error Packets 1, Stream 36
Sep 27 17:39:05  re0.cr2-esams fpc0 CMError: /fpc/0/pfe/0/cm/0/XMCHIP(0)/0/XMCHIP_CMERROR_OCM_INTR_SRC_RDDST_PERR_MINOR (0x7028a), scope: pfe, category: functional, severity: minor, module: XMCHIP(0), type: OCM: Detected: Minor parity errors
Sep 27 17:39:06  re0.cr2-esams fpc0 Performing action log for error /fpc/0/pfe/0/cm/0/XMCHIP(0)/0/XMCHIP_CMERROR_OCM_INTR_SRC_RDDST_PERR_MINOR (0x7028a) in module: XMCHIP(0) with scope: pfe category: functional level: minor
Sep 27 17:39:06  re0.cr2-esams fpc0 Performing action cmalarm for error /fpc/0/pfe/0/cm/0/XMCHIP(0)/0/XMCHIP_CMERROR_OCM_INTR_SRC_RDDST_PERR_MINOR (0x7028a) in module: XMCHIP(0) with scope: pfe category: functional level: minor
Sep 27 17:39:06  re0.cr2-esams fpc0 Cmerror Op Set: XMCHIP(0): XMCHIP(0): OCM: Parity error detected - data32 0x8, data32_ptyerrcnt 0x1  (URI: /fpc/0/pfe/0/cm/0/XMCHIP(0)/0/XMCHIP_CMERROR_OCM_INTR_SRC_RDDST_PERR_MINOR)
Sep 27 17:39:08  re0.cr2-esams mosquitto[29615]: Allocated node for mosq : 0x79f780, Client : client-1-re0-NA_periodic_subscriber-re0, topic : /1002/1/0, max bytes in queue : 10485760, hash_size is 500, hashIndex is 0x1221000

Seems covered by this doc: https://supportportal.juniper.net/s/article/Parity-errors-occured-on-one-FPC-causing-minor-alarms-for-other-FPCs-on-the-same-router?language=en_US

Next step is to reboot FPC0. If no luck, then open a support ticket for an RMA.

Details

Event Timeline

ayounsi triaged this task as Medium priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I will reboot this tomorrow morning, Oct 6th at 08:00 and we can take it from there.

Change 839396 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Depool esams in gdns prior to reboot of line card

https://gerrit.wikimedia.org/r/839396

Change 839396 merged by Cathal Mooney:

[operations/dns@master] Depool esams in gdns prior to reboot of line card

https://gerrit.wikimedia.org/r/839396

Mentioned in SAL (#wikimedia-operations) [2022-10-06T08:54:45Z] <topranks> rebooting line card fpc 0 on cr2-esams (T318783)

Reboot completed sucessfully, currently router not showing any alarms:

root@re0.cr2-esams> show system alarms                                      
No alarms currently active

I'll leave it a week or so and if it remains clear close the task, or if we get more errors proceed to open a TAC case with Juniper.

cmooney claimed this task.

Gonna close this one device still showing ok and no alarms for FPC errrors. We can re-open if problem happens again.

cmooney@re0.cr2-esams> show system alarms 
Oct 13 09:52:00
No alarms currently active

{master}
cmooney@re0.cr2-esams>

It's back :(

cr2-esams> show system alarms 
1 alarms currently active
Alarm time               Class  Description
2022-12-05 18:15:58 UTC  Minor  FPC 0 Minor Errors
Dec  5 18:15:58  re0.cr2-esams fpc0 XMCHIP(1): XMCHIP(1): FI: Error cell sent to reorder engine - Stream 0, Count 1
Dec  5 18:15:58  re0.cr2-esams fpc0 XMCHIP(0): CALD4521: XMCHIP(0): DDRIF: Checksum error for FO/WO2 - Channel 11, Address 0xb814e, Checksum Errors 1, Checksum Poison Count 1
Dec  5 18:15:58  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): DDRIF: 2x32 Instance 2, Part 1 - Write rate: 1887837, Read rate: 1887864, Total rate: 3775701
Dec  5 18:15:58  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): DDRIF: Packet memory - Write rate: 33524620, Read rate: 33524820, Total rate: 67049440
Dec  5 18:15:58  re0.cr2-esams chassisd[29541]: CHASSISD_FPC_ASIC_ERROR: <FPC 0> ASIC Error detected errorno 0x0007028a (null)
Dec  5 18:15:58  re0.cr2-esams alarmd[30723]: Alarm set: FPC id=167772264, color=YELLOW, class=CHASSIS, reason=FPC 0 Minor Errors
Dec  5 18:15:58  re0.cr2-esams craftd[29545]:  Minor alarm set, FPC 0 Minor Errors
Dec  5 18:15:59  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): DDRIF: Linkram memory - Write rate: 0, Read rate: 0, Total rate: 0
Dec  5 18:15:59  re0.cr2-esams fpc0 XMCHIP(0): XMCHIP(0): FO: Packet error - Error Packets 1, Stream 1
Dec  5 18:15:59  re0.cr2-esams fpc0 CMError: /fpc/0/pfe/0/cm/0/XMCHIP(0)/0/XMCHIP_CMERROR_OCM_INTR_SRC_RDDST_PERR_MINOR (0x7028a), scope: pfe, category: functional, severity: minor, module: XMCHIP(0), type: OCM: Detected: Minor parity errors

I tried to create an JTAC ticket for an RMA but am getting:

Our records show that the Service Contract has expired for the serial number or Software Support Reference Number (SSRN) you entered or that the Serial Number is out of warranty. A valid product Serial Number or SSRN is required to open a Technical Case. If you believe this information is incorrect, please open an Admin Case or chat with Customer Care for assistance.

Most likely due to T315378

ayounsi mentioned this in Unknown Object (Task).Dec 7 2022, 8:25 AM

JTAC case 2022-1207-600204 opened asking for an RMA as it's the 2nd time the issue happens.

ayounsi mentioned this in Unknown Object (Task).Dec 13 2022, 9:38 AM

JTAC wants us to try to re-seat the linecard before doing any RMA. Work scheduled for Jan 12th. Opened procurement {T325048} for the remote hands work.

Mentioned in SAL (#wikimedia-operations) [2023-01-12T11:54:37Z] <XioNoX> re-seating cr2-esams fpc0 linecard - T318783

I shutdown the linecard and remote hands re-seat it. Hopefully that solved it for good.

The issue is back:

2023-01-30 12:36:42 UTC Minor FPC 0 Minor Errors

we need to follow up with JTAC for a replacement.

This got promoted to major.

cr2-esams> show system alarms 
2 alarms currently active
Alarm time               Class  Description
2023-07-28 23:46:09 UTC  Major  FPC 0 Major Errors
2023-01-30 12:36:42 UTC  Minor  FPC 0 Minor Errors

Let's hope it doesn't fully break before the onsite visit.

Looking more into the alert and status, both ports on FPC0 PIC2 are down, one of which is the link to asw2-esams, so we have a loss of redundancy (traffic now only goes through cr3-esams).

Opened high priority case 2023-0809-747283 asking for a RMA.

RMA in progress, Juniper happy with address for replacement and staff at destination are aware of delivery.

I will decom the existing faulty card on Sunday when on site and prep for return.

@ayounsi yeah I think so, the RMA is complete as far as Juniper is concerned and we are no longer using the old card.

It's unclear to me if the new card has been received in codfw, Interxion ticket suggests it was collected in Amserdam. That's slightly separate to this task of course.

@Jhancock.wm do you know if there was a delivery for us in codfw coming from Digital Realty / Interxion Amsterdam?

Was a box like this:

line_card.jpg (1,200×1,256 px, 373 KB)

Has a juniper line card in it (and doesn't have the metal bracket @Papaul asked me to put in cos I found that in my suitcase when I got home - sorry!)

@cmooney I haven't received it yet. I checked with the dock to make sure it hasn't arrived and we weren't notified but no luck. Is there a tracking number for the package?

@Jhancock.wm not 100%, I will try to chase on that.

I am going to close this task, the FPC issue was addressed through card replacement (although we decom'd router in the meantime).

Despite my best efforts it seems the replacement has gone missing, but we will track that elsewhere.