Page MenuHomePhabricator

cr3-ulsfo incident 22 Sep 2024
Closed, ResolvedPublic

Description

We seem to have had an incident on cr3-ulsfo this afternoon. Page came in for ping loss to the device. Checking logs there seems to have been a serious issue on the device, causing interfaces and protocol adjacencies to reset. See the logs in P69384

Looking at timestamps we were definitely impacted for a few minutes, for instance:

Sep 22 12:14:45  cr3-ulsfo rpd[30706]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 4.15.72.113 (External AS 3356) changed state from Established to Idle (event Stop) (instance master)
Sep 22 12:17:03  cr3-ulsfo rpd[30706]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 4.15.72.113 (External AS 3356) changed state from OpenConfirm to Established (event RecvKeepAlive) (instance master)

First logs have this in them:

Sep 22 12:14:44  cr3-ulsfo fpc0 JGCI[EA[0:0]-intf-2] JGCI_INT_REG_LKR_1_FORCED_RETRAIN seen
Sep 22 12:14:44  cr3-ulsfo fpc0 JGCI[EA[0:0]-intf-2] JGCI_INT_REG_LKR_0_FORCED_RETRAIN seen
Sep 22 12:14:44  cr3-ulsfo fpc0 JGCI[EA[0:0]-intf-2] JGCI_INT_REG_CHN_0_TRAIN_ATTN seen; Current retrain count is 0x6
Sep 22 12:14:44  cr3-ulsfo fpc0 EA[0:0]-0 Congestion Detected, Active Zones f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f
Sep 22 12:14:44  cr3-ulsfo fpc0 EA[0:0]-1 Congestion Detected, Active Zones:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:e:f:f:f
Sep 22 12:14:44  cr3-ulsfo fpc0 EA[0:0]-2 Congestion Detected, Active Zones:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f
Sep 22 12:14:44  cr3-ulsfo fpc0 EA[0:0]-3 Congestion Detected, Active Zones:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:f:b

This Juniper KB article suggests that the built-in FPC on the MX204 may have unexpectedly restarted, which would seem to fit the symptoms.

CPU spike around the time, presumably as interfaces came back up, protocols restarted, BGP convergence, but seems ok right now:

image.png (780×1 px, 279 KB)

BGP sessions are all back up and throughput looks normal, so for right now things seem stable:

cmooney@cr3-ulsfo> show bgp summary | match ^[0-9]
Sep 22 12:45:47
4.15.72.113            3356     178319         64       0       3       28:44 Establ
10.128.0.6            64605         67         61       0     231       28:27 Establ
10.128.0.7            64605         68         61       0     229       28:28 Establ
10.128.0.9            64600        174        191       0      26       28:34 Establ
10.128.0.11           64600        173        190       0      46       28:26 Establ
10.128.0.13           65004         61     417366       0     233       28:31 Establ
10.128.0.18           64600        174        192       0      33       28:30 Establ
80.239.192.65          1299     195026         64       0       6       28:39 Establ
103.102.166.130       65005        886        159       0      11       28:34 Establ
103.102.166.131       65005        888        915       0       8       28:33 Establ
198.35.26.6           64605         68         61       0      36       28:33 Establ
198.35.26.7           64605         68         61       0      34       28:33 Establ
198.35.26.8           64605         69         60       0      36       28:30 Establ
198.35.26.14          64605         68         61       0      36       28:32 Establ
198.35.26.193         65004     109201     288644       0       3       28:32 Establ
198.35.26.227         11820          0          0       0      35 1w3d 15:53:12 Active
198.35.26.228         11820          0          0       0      23 1w3d 16:06:44 Active
206.197.187.1         12276         58         63       0      18       28:22 Establ
206.197.187.6         20940         66         63       0       6       28:23 Establ
206.197.187.9         10310        103         64       0      14       28:24 Establ
206.197.187.11         2906         82         62       0       9       28:12 Establ
206.197.187.12         6939      29224         64       0       6       28:24 Establ
206.197.187.13        32329         66         63       0       9       28:19 Establ
206.197.187.14        19165         62         63       0      26       28:22 Establ
206.197.187.18           42         72         63       0       6       28:22 Establ
206.197.187.21        18779        122         63       0       3       28:14 Establ
206.197.187.24        10310        102         62       0      14       28:20 Establ
206.197.187.25        21928        125         63       0       7       28:21 Establ
206.197.187.30       397715         58         62       0      58       27:32 Establ
206.197.187.33        18779        112         63       0       5       28:14 Establ
206.197.187.40        63949         64         63       0       7       28:22 Establ
206.197.187.41        35008         63         63       0       5       28:17 Establ
206.197.187.45        36236         94         63       0       4       28:22 Establ
206.197.187.47        33452         59         63       0       6       28:24 Establ
206.197.187.50        15169        698        128       0     264       28:24 Establ
206.197.187.53       138997         67         63       0       9       28:24 Establ
206.197.187.56        27321          0          0       0      31 9w3d 19:09:49 Active
206.197.187.62        13335       1814         62       0      41       28:08 Establ
206.197.187.64         7500         66         63       0       8       28:17 Establ
206.197.187.65        16509        249         63       0     166       28:24 Establ
206.197.187.71        26073         72         63       0      12       28:24 Establ
206.197.187.78        26073         73         63       0      15       28:22 Establ
206.197.187.84        32035         73         63       0      25       28:21 Establ
206.197.187.87        49544        127         63       0      10       28:23 Establ
206.197.187.91        32934         66         63       0       9       28:19 Establ
206.197.187.92        32934         66         63       0      14       28:15 Establ
206.197.187.93        11170          0          0       0      25 19w4d 17:33:00 Connect
206.197.187.96        15169        206         62       0     155       28:17 Establ
206.197.187.99         8674         60         63       0       3       28:06 Establ
206.197.187.100       46997         71         64       0      42       28:24 Establ
206.197.187.103       16509        249         63       0      67       28:22 Establ
206.197.187.253       63055      28380         64       0      78       28:23 Establ
206.197.187.254       63055      19897         64       0      19       28:24 Establ
208.80.153.192        65002        880        108       0       8       28:33 Establ
208.80.153.193        65002        878        121       0       6       28:32 Establ
208.80.154.196        65001        874        121       0       6       28:35 Establ
208.80.154.197        65001        883        121       0       4       28:33 Establ
208.80.154.198        65020        889        120       0      45       28:35 Establ
2001:504:30::ba00:42:1          42         71         61       0       6       28:20 Establ
2001:504:30::ba00:2906:1        2906         73         62       0       7       28:24 Establ
2001:504:30::ba00:6939:1        6939      57187         62       0      59       28:17 Establ
2001:504:30::ba00:7500:1        7500         65         62       0       8       28:21 Establ
2001:504:30::ba00:8674:1        8674         60         62       0       3       28:21 Establ
2001:504:30::ba01:310:1       10310        111         61       0      14       28:21 Establ
2001:504:30::ba01:310:2       10310        111         61       0      15       28:22 Establ
2001:504:30::ba01:1170:1       11170          0          0       0      25 19w4d 17:32:44 Active
2001:504:30::ba01:2276:1       12276         58         62       0      18       28:08 Establ
2001:504:30::ba01:3335:2       13335        257         62       0      19       28:21 Establ
2001:504:30::ba01:5169:1       15169        363        126       0     274       28:24 Establ
2001:504:30::ba01:5169:2       15169        152         62       0     128       28:24 Establ
2001:504:30::ba01:6509:1       16509        123         62       0      82       28:21 Establ
2001:504:30::ba01:6509:2       16509        128         62       0      37       28:22 Establ
2001:504:30::ba01:8779:1       18779         60         62       0       6       28:12 Establ
2001:504:30::ba01:8779:2       18779         60         62       0       5       28:08 Establ
2001:504:30::ba02:940:1       20940         66         62       0       6       28:24 Establ
2001:504:30::ba02:1928:1       21928        146         62       0       7       28:12 Establ
2001:504:30::ba02:6073:1       26073         78         62       0      13       28:23 Establ
2001:504:30::ba02:6073:2       26073         78         62       0      13       28:20 Establ
2001:504:30::ba02:7321:1       27321          0          0       0      23 9w3d 19:10:04 Active
2001:504:30::ba03:2035:1       32035         67         61       0      25       28:15 Establ
2001:504:30::ba03:2329:1       32329         61         62       0       9       28:21 Establ
2001:504:30::ba03:2934:1       32934         65         62       0       9       28:24 Establ
2001:504:30::ba03:2934:2       32934         65         62       0      14       28:20 Establ
2001:504:30::ba03:5008:1       35008         63         62       0       5       28:24 Establ
2001:504:30::ba03:6236:1       36236         83         62       0       4       28:24 Establ
2001:504:30::ba04:9544:1       49544        104         62       0      11       28:23 Establ
2001:504:30::ba06:3055:1       63055      19006         62       0      69       28:21 Establ
2001:504:30::ba06:3055:2       63055      16663         62       0      19       28:21 Establ
2001:504:30::ba06:3949:1       63949         65         62       0       7       28:24 Establ
2001:504:30::ba13:8997:1      138997         66         62       0      11       28:24 Establ
2001:504:30::ba39:7715:1      397715         58         61       0      58       27:36 Establ
2001:1900:2100::a99        3356      70847         62       0       3       28:18 Establ
2001:2000:3080:a9a::1        1299      76316         63       0       6       28:26 Establ
2620:0:863:1:198:35:26:6       64605         68         60       0      35       28:21 Establ
2620:0:863:1:198:35:26:14       64605         67         60       0      35       28:14 Establ
2620:0:863:101:10:128:0:6       64605         67         60       0     239       28:19 Establ
2620:0:863:101:10:128:0:7       64605         69         60       0     227       28:26 Establ
2620:0:863:ffff::2       65004      67026      64311       0       3       28:23 Establ

image.png (766×1 px, 373 KB)

Event Timeline

cmooney created this task.

Icinga downtime and Alertmanager silence (ID=5186e097-9b87-468b-abf6-b7a7fcd918c6) set by cmooney@cumin1002 for 1 day, 0:00:00 on 3 host(s) and their services with reason: cr3-ulsfo fpc restart and hw instability

cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt

This is after re-occuring, approx 44 minutes after the first time, first logs mention this:

Sep 22 12:58:28  cr3-ulsfo fpc0 listmgr_host_xtxn_idle(1526): EA[0:0].llm:  Host XTXN 15 busy waiting for Response, fcode_addr reg 0x0000124cfc2bef26.

Perhaps something to do with particular traffic recursing through the forwading engine?

https://supportportal.juniper.net/s/article/MX-Lookup-chip-LU-XL-congestion-due-to-forwarding-loop-caused-by-misconfiguration?language=en_US

Certainly we'll need a JTAC case and possibly to replace the router. Perhaps a reboot will help but we have depooled ulsfo now so I won't do that just now so we can gather data if required. I've downtimed cr3-ulsfo for 24 hours to stop further alerts if it happens again.

Logs from second time are in P69385

Opened high priority JTAC case 2024-0923-266479 and attached logs/debug output.

Icinga downtime and Alertmanager silence (ID=a9eff4bb-15d3-41a4-8dd6-65ccc0663c06) set by ayounsi@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their services with reason: waiting for JTAC

cr3-ulsfo

From JTAC:

[...] after engaging further resources we have been requested to attempt a full chassis reboot and check if the issue persists before proceeding with the RMA, please let us know if this has been done or if it can be done, let us know of the results.

Mentioned in SAL (#wikimedia-operations) [2024-09-24T13:37:41Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: cr3-ulsfo rebooted, repooling ulsfo, T375345]

Mentioned in SAL (#wikimedia-operations) [2024-09-24T13:37:53Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: cr3-ulsfo rebooted, repooling ulsfo, T375345]

Closing, will re-open if the issue happens again and we need to RMA it.

cr3-ulsfo> show system alarms 
1 alarms currently active
Alarm time               Class  Description
2024-09-25 13:11:42 UTC  Minor  FPC 0 Minor Errors

I'll follow up with JTAC, but we probably should depool the site to be on the safe side as this error was showing up on and off around the previous outage.

Mentioned in SAL (#wikimedia-operations) [2024-09-25T16:10:47Z] <vgutierrez@cumin1002> START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: cr3-ulsfo issues, T375345]

Mentioned in SAL (#wikimedia-operations) [2024-09-25T16:10:56Z] <vgutierrez@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: cr3-ulsfo issues, T375345]

Icinga downtime and Alertmanager silence (ID=dcd9deb3-f5d9-41d3-ade0-567f7154bb5b) set by ayounsi@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: router broken ☠️

cr3-ulsfo

Adding in ops-ulsfo project tag as I've been CC'd in at this point for the actual processing of the on-site steps for this failed hardware.

Inbound shipment ticket 00980858 for UPS 1Z20506Y0100053206 (already delivered today and got the shipment notice last night).

Next step is schedule and go onsite to swap the MX204, will bring usb stick so we can copy over the config and wipe existing router and easily restore config to new router.

Entered ticket 00981959 for this swap to take place today:

Support,

We would like to request remote hands to assist in retriving a shipment from your storage and swapping out a failed router in our racks, then shipping the failed router back in the same packaging to JCare support.

Scheduling: We just had another router fail in the sister caching site to this, so the urgency has gone up. This work can take place immediately, just contact us before the actual hardware swap takes place so we can power off the old router.

Please retrieve shipment 00980858 from Junper to Wikimedia, it contains a replacement MX204 router and return tag for the defective one that will be removed from the rack.

Router to be swapped is located in 103.02.22:U42, serial number BK033, labeled cr3-ulsfo.

Contact info: Once you are on the datahall floor you can feel free to contact me via text or phone call. My cell phone is +1-727-255-4597.

Cadence of work:

  • Retrive shipemnt 00980858 and confirm it is a MX204 router, note the serial number for our records.
  • Stage swap in front of our rack 103.02.22, Router to be swapped is 103.02.22:U42, serial number BK033, labeled cr3-ulsfo.
  • Please photograph/note the fibers and their placements as they need to go back into the exact same spots on the new router. (Includes the green mgmt and orange serial connections.)
  • Update us via phone call/text or ticket update and we will wipe the existing router's configuration and power it down for you to swap.
  • Remote all optics and patches from old router, remote old router from rack, mount new router, return all optics and patches to the ports they populated on the old router, but on the new router.
  • Update us via phone call/text or ticket update so we can check and connect to the new router.
  • Please photograph the return shipment label for our records.
  • Ensure all optics and accessories were removed from the old router before boxing it back up and shipping it back with the attached return label.

Thanks!

Rob

Ticket accepted and they've retrieved the replacement shipment, should get pinged from them shortly to start the swap.

Swap completed and @Papaul confirms they can attach via serial console. The onsite portion of this troubleshooting and repair should now be considered complete, unless papaul & netops finds any issues with individual port(s)

Mentioned in SAL (#wikimedia-operations) [2024-09-30T20:56:14Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: repool ulsfo as cr3-ulsfo was replaced, T375345]

Mentioned in SAL (#wikimedia-operations) [2024-09-30T20:56:24Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: repool ulsfo as cr3-ulsfo was replaced, T375345]

Junos upgrade complete for the system Icinga checks back green. All good on the router, site can be pool back
Thanks

Mentioned in SAL (#wikimedia-operations) [2024-10-01T06:44:04Z] <XioNoX> cr3-ulsfo> request vmhost snapshot - T375345

Thanks, all is good now !

Papaul claimed this task.

I have to update netbox with the inventory and new serial number

Add both power supplies in Netbox under inventory