Page MenuHomePhabricator

Configure BGP route damping on Anycast sessions
Closed, ResolvedPublic

Description

Yesterday's network outage caused some hosts' connectivity to flap, which caused BFD and BGP to flap and the anycast IPs they were advertising to be partially unreachable the time BFD detects the failure (3x300ms) and BGP converges.

To limit that issue, we can push BGP damping on those sessions, similarly to T222424.
As we have multiple anycast nodes in multiple DCs the risk of them all being damped at the same time is very low.

The change (to do in Homer) should be along the lines of the following, re-using the existing damping parameters:

[edit protocols bgp group Anycast4]
+    damping;
[edit policy-options policy-statement anycast_import term anycast4 then]
+     damping default;

Event Timeline

ayounsi triaged this task as Medium priority.Sep 9 2020, 7:00 AM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptSep 9 2020, 7:00 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis added a subscriber: CDanis.Wed, Sep 23, 4:01 PM

I'm no expert here, but seems reasonable enough to me.

jbond added a comment.Wed, Sep 23, 5:48 PM

Sorry missed this looks good to me

Change 629652 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Add damping on Anycast BGP sessions

https://gerrit.wikimedia.org/r/629652

Change 629652 merged by jenkins-bot:
[operations/homer/public@master] Add damping on Anycast BGP sessions

https://gerrit.wikimedia.org/r/629652

Mentioned in SAL (#wikimedia-operations) [2020-09-24T13:17:49Z] <XioNoX> add damping to anycast BGP - T262372

ayounsi closed this task as Resolved.Thu, Sep 24, 1:53 PM

This is all done.

Mentioned in SAL (#wikimedia-operations) [2020-09-28T09:06:15Z] <XioNoX> restart bird on centrallog2001 - T262372

Mentioned in SAL (#wikimedia-operations) [2020-09-28T09:17:09Z] <XioNoX> restart bird on dns2001 - T262372

ayounsi reopened this task as Open.EditedMon, Sep 28, 9:32 AM

When bird restarts on the centrallog servers it causes bird to bounce a few times:

Sep 28 09:06:18 centrallog2001 bird: Shutting down
Sep 28 09:06:18 centrallog2001 bird: Shutdown completed
Sep 28 09:06:28 centrallog2001 systemd[1]: bird.service: Succeeded.
Sep 28 09:06:28 centrallog2001 bird: Started
Sep 28 09:07:38 centrallog2001 bird: Reconfiguring
Sep 28 09:07:38 centrallog2001 bird: Reloading protocol bgp1
Sep 28 09:07:38 centrallog2001 bird: Reloading protocol bgp2
Sep 28 09:07:38 centrallog2001 bird: Reconfigured
Sep 28 09:07:47 centrallog2001 bird: Reconfiguring
Sep 28 09:07:47 centrallog2001 bird: Reloading protocol bgp1
Sep 28 09:07:47 centrallog2001 bird: Reloading protocol bgp2
Sep 28 09:07:47 centrallog2001 bird: Reconfigured
Sep 28 09:08:58 centrallog2001 bird: Reconfiguring
Sep 28 09:08:58 centrallog2001 bird: Reloading protocol bgp1
Sep 28 09:08:58 centrallog2001 bird: Reloading protocol bgp2
Sep 28 09:08:58 centrallog2001 bird: Reconfigured
Sep 28 09:09:07 centrallog2001 bird: Reconfiguring
Sep 28 09:09:07 centrallog2001 bird: Reloading protocol bgp1
Sep 28 09:09:07 centrallog2001 bird: Reloading protocol bgp2
Sep 28 09:09:07 centrallog2001 bird: Reconfigured

This triggers the routers damping mechanism and drops the prefix.

This is not systematic though and each reconfigure match the anycast_healthchecker logs:

2020-09-28 09:07:36,646 anycast-healthchecker[3823] INFO     hc-vip-syslog.anycast.wmnet  running /bin/sh -c ss -lun | fgrep -q :10514
2020-09-28 09:07:37,649 anycast-healthchecker[3823] ERROR    hc-vip-syslog.anycast.wmnet  check timed out
2020-09-28 09:07:38,026 anycast-healthchecker[3823] INFO     hc-vip-syslog.anycast.wmnet  status DOWN
2020-09-28 09:07:38,027 anycast-healthchecker[3823] INFO     hc-vip-syslog.anycast.wmnet  adding 10.3.0.4/32 in the queue
2020-09-28 09:07:38,027 anycast-healthchecker[3823] INFO     hc-vip-syslog.anycast.wmnet  wall clock time 1383.204ms
2020-09-28 09:07:38,027 anycast-healthchecker[3823] INFO     MainThread                   returned an item from the queue for hc-vip-syslog.anycast.wmnet with IP prefix 10.3.0.4/32 and action to delete from Bird configuration
2020-09-28 09:07:38,027 anycast-healthchecker[3823] INFO     MainThread                   withdrawing 10.3.0.4/32 for hc-vip-syslog.anycast.wmnet
2020-09-28 09:07:38,042 anycast-healthchecker[3823] INFO     MainThread                   Bird configuration for IPv4 is updated
2020-09-28 09:07:38,042 anycast-healthchecker[3823] WARNING  MainThread                   Bird configuration doesn't have IP prefixes for any of the services we monitor! It means local node doesn't receive any traffic
2020-09-28 09:07:38,042 anycast-healthchecker[3823] INFO     MainThread                   reconfiguring BIRD by running /usr/sbin/birdc configure
2020-09-28 09:07:38,046 anycast-healthchecker[3823] INFO     MainThread                   reconfigured BIRD daemon

I manually removed the damping config on cr2-codfw to make sure we don't blackhole that prefix.

If we compare it with for example dns2001, a restart doesn't cause this flapping:

dns2001:~$ cat /var/log/syslog | grep bird
Sep 28 09:17:11 dns2001 bird: Shutting down
Sep 28 09:17:11 dns2001 bird: Shutdown completed
Sep 28 09:17:21 dns2001 systemd[1]: bird.service: Succeeded.
Sep 28 09:17:21 dns2001 bird: Started
ayounsi raised the priority of this task from Medium to High.Mon, Sep 28, 9:34 AM
ayounsi added a comment.EditedMon, Sep 28, 12:28 PM

The issue is that ss -lun | fgrep -q :10514 often take more than 2s to complete and we don't let it retry. As it happen regularly, it sometimes happen right after the bird restart, triggering the damping.

I disabled Puppet on centrallog2001 and replaced the command with ss -4lun | fgrep -q :10514 (to ignore IPv6). If that doesn't solve the issue, I'll send a CR to make check_fail (retry) customizable (as we currently force it at 1).

After manually setting check_fail = 2 overnight the service stopped being randomly depooled. Bird restarts didn't trigger camping neither.

Change 630757 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Anycast: add check_fail

https://gerrit.wikimedia.org/r/630757

Change 630757 merged by Ayounsi:
[operations/puppet@production] Anycast: add check_fail

https://gerrit.wikimedia.org/r/630757

ayounsi closed this task as Resolved.Tue, Sep 29, 11:22 AM

Fixed.