Page MenuHomePhabricator

Add graceful-restart to cr2-esams
Closed, ResolvedPublic

Description

It's somehow missing, which mean a change that is supposed to be brief (like a tcp-mss change) is causing the BGP sessions to fully go down, then back up. Which cause 2 BGP re-convergences in a short amount of time, which caused unreachability for the users going through those paths.

And audit all routers to make sure they have graceful-restart configured.

Details

Related Gerrit Patches:

Related Objects

Event Timeline

ayounsi triaged this task as Medium priority.Feb 27 2020, 1:35 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2020, 1:35 PM

Note that enabling graceful-restart will cause all BGP sessions to flap.

CDanis added a subscriber: CDanis.Mar 3 2020, 3:14 PM

Happy to help here, e.g. to perform this at an off-peak time in esams/knams.

ayounsi added a comment.EditedMar 3 2020, 3:46 PM

Steps are:

  1. Depool esams
  2. Ssh to the mgmt interface re0.cr2-esams.mgmt.esams.wmnet less likely to be impacted by the flaps
  3. run conf then set routing-options graceful-restart
  4. commit
  5. check that all BGP sessions are back to Established with show bgp summary
  6. And that they have Options: [...] GracefulRestart in show bgp neighbor
  7. repool esams
RLazarus moved this task from On-going to Follow-up on the Wikimedia-Incident board.
CDanis claimed this task.Thu, Mar 5, 7:17 PM
CDanis added a comment.Thu, Mar 5, 7:40 PM

Will do this tonight, at or after 00:00 UTC.

Currently there are just a few (AMSIX peer) BGP sessions down:

cdanis@re0.cr2-esams> show bgp summary | match "(Active|Idle|Connect)" 
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
80.249.208.180         9031          0          0       0       2 1w3d 5:44:07 Active
80.249.210.246        28598       8043       5166       0       3 1w4d 5:56:04 Idle  
2001:7f8:1::a500:9031:1        9031          0          0       0       1 1w3d 5:44:07 Active

Will check for the same set after the maintenance.

Change 577363 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] depool esams for cr2 router maintenance

https://gerrit.wikimedia.org/r/577363

Change 577363 merged by CDanis:
[operations/dns@master] depool esams for cr2 router maintenance

https://gerrit.wikimedia.org/r/577363

Mentioned in SAL (#wikimedia-operations) [2020-03-06T00:02:54Z] <cdanis> T246338 depool esams for router maintenance

Committed configuration at 00:21 UTC.

Took a few minutes for all BGP sessions to be recreated but eventually wound up with this:

cdanis@re0.cr2-esams> show bgp summary | match "(Active|Idle|Connect)"    
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
80.249.208.180         9031          0          0       0       0        7:42 Active
80.249.209.236         1267          0          0       0       0        7:42 Connect
2001:7f8:1::a500:1267:2        1267          0          0       0       0        7:42 Active
2001:7f8:1::a500:9031:1        9031          0          0       0       0        7:42 Active

The session that was Idle is now established again, we just have one more AMS-IX peer (AS1267, WIND Telecomunicazioni S.p.A) that isn't working for whatever reason.

GracefulRestart shows up in all neighbor options.

Calling this a success

Change 577390 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] Revert "depool esams for cr2 router maintenance"

https://gerrit.wikimedia.org/r/577390

Change 577390 merged by CDanis:
[operations/dns@master] Revert "depool esams for cr2 router maintenance"

https://gerrit.wikimedia.org/r/577390

Mentioned in SAL (#wikimedia-operations) [2020-03-06T00:33:58Z] <cdanis> repool esams T246338

CDanis added a comment.Fri, Mar 6, 2:15 AM
cdanis@re0.cr2-esams> show bgp summary | match "(Active|Idle|Connect)" 
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
80.249.208.180         9031          0          0       0       0     1:54:33 Active
2001:7f8:1::a500:9031:1        9031          0          0       0       0     1:54:33 Active
CDanis closed this task as Resolved.Fri, Mar 6, 8:34 PM

Discussion of rolling out graceful-restart to other dual-RE routers is at T191667#5948038