Page MenuHomePhabricator

IPv6 ~20ms higher ping than IPv4 to gerrit
Closed, ResolvedPublic

Description

$ mtr -4 -w gerrit.wikimedia.org
Start: 2018-12-04T07:59:42+0000
HOST: ubuntu64-web-esxi                           Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway                                     0.0%    10    0.3   0.4   0.3   1.1   0.2
  2.|-- bottomless.aa.net.uk                         0.0%    10   11.1  11.2  11.0  11.8   0.2
  3.|-- e.aimless.tch.aa.net.uk                     40.0%    10   11.5  11.4  11.3  11.5   0.1
  4.|-- xe-0-1-0-3-1.r04.londen05.uk.bb.gin.ntt.net  0.0%    10   11.8  11.9  11.1  12.2   0.3
  5.|-- ae-0.r24.londen12.uk.bb.gin.ntt.net          0.0%    10   12.0  13.6  11.8  28.1   5.1
  6.|-- ae-5.r24.nycmny01.us.bb.gin.ntt.net          0.0%    10   78.9  79.4  78.6  81.3   0.9
  7.|-- ae-1.r25.nycmny01.us.bb.gin.ntt.net          0.0%    10   80.1  80.2  80.0  80.5   0.1
  8.|-- ae-9.r22.asbnva02.us.bb.gin.ntt.net          0.0%    10   85.7  85.7  85.4  86.1   0.2
  9.|-- ae-1.r05.asbnva02.us.bb.gin.ntt.net          0.0%    10   86.0  85.9  85.5  86.3   0.2
 10.|-- ae-0.a03.asbnva02.us.bb.gin.ntt.net          0.0%    10   84.9  87.1  84.4  96.4   4.4
 11.|-- xe-0-0-28-0.a03.asbnva02.us.ce.gin.ntt.net   0.0%    10   86.0  86.2  85.1  93.9   2.7
 12.|-- gerrit.wikimedia.org                         0.0%    10   85.9  85.8  85.6  86.0   0.1
$ mtr -w gerrit.wikimedia.org
Start: 2018-12-04T07:57:51+0000
HOST: ubuntu64-web-esxi                                                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- a.c.4.5.b.b.1.5.f.f.f.f.0.0.0.0.1.1.1.1.1.1.1.1.0.b.8.0.1.0.0.2.ip6.arpa  0.0%    10    0.3   0.4   0.3   0.4   0.0
  2.|-- a.gormless.thn.aa.net.uk                                                  0.0%    10   11.0  11.1  11.0  11.3   0.1
  3.|-- ntt.a.needless.tch.aa.net.uk                                              0.0%    10   11.4  11.4  11.3  11.6   0.1
  4.|-- xe-0-1-0-3-1.r04.londen05.uk.bb.gin.ntt.net                               0.0%    10   11.7  11.7  11.3  12.0   0.2
  5.|-- ae-0.r24.londen12.uk.bb.gin.ntt.net                                       0.0%    10   11.9  11.7  11.6  11.9   0.1
  6.|-- ae-5.r24.nycmny01.us.bb.gin.ntt.net                                       0.0%    10   80.2  80.1  79.2  81.5   0.6
  7.|-- ae-1.r25.nycmny01.us.bb.gin.ntt.net                                       0.0%    10   78.6  78.7  78.4  79.0   0.2
  8.|-- ae-9.r22.asbnva02.us.bb.gin.ntt.net                                       0.0%    10   85.6  85.8  85.4  86.6   0.4
  9.|-- ae-7.r06.asbnva02.us.bb.gin.ntt.net                                       0.0%    10   84.7  84.4  83.8  84.8   0.3
 10.|-- ae-1.a03.asbnva02.us.bb.gin.ntt.net                                       0.0%    10   84.5  86.2  84.0  99.6   4.7
 11.|-- xe-0-0-28-0.a03.asbnva02.us.ce.gin.ntt.net                                0.0%    10  103.9 104.0 103.8 104.3   0.2
 12.|-- gerrit.wikimedia.org                                                      0.0%    10  105.1 105.1 105.0 105.3   0.1

It seems on ipv6 between xe-0-0-28-0.a03.asbnva02.us.ce.gin.ntt.net and gerrit.wikimedia.org it's picking up ~20ms of latency

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

From bast1001 to the endpoints shown in line (2) above over v4 and v6:

bblack@bast1002:~$ mtr -c 10 -r -4 bottomless.aa.net.uk
Start: Tue Dec  4 12:23:35 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.2   0.2   0.2   0.4   0.0
  2.|-- ae0.cr1-eqiad.wikimedia.o  0.0%    10    0.2   0.2   0.2   0.3   0.0
  3.|-- xe-0-0-28-0.a03.asbnva02.  0.0%    10    1.7   2.8   0.6  11.6   3.4
  4.|-- ae-70.r06.asbnva02.us.bb.  0.0%    10   72.3  72.4  72.3  72.6   0.0
  5.|-- ae-2.r22.asbnva02.us.bb.g  0.0%    10    1.5   2.8   0.6  10.0   3.2
  6.|-- ae-5.r25.nycmny01.us.bb.g  0.0%    10    6.1   6.1   6.1   6.4   0.0
  7.|-- ae-1.r24.nycmny01.us.bb.g  0.0%    10    6.7   6.9   6.7   7.6   0.0
  8.|-- ae-9.r24.londen12.uk.bb.g  0.0%    10   73.7  74.6  73.7  79.4   1.7
  9.|-- ae-1.r04.londen05.uk.bb.g  0.0%    10   73.7  73.6  73.5  73.9   0.0
 10.|-- e.aimless.aa.net.uk       50.0%    10   74.6  74.7  74.6  74.7   0.0
 11.|-- bottomless.aa.net.uk       0.0%    10   74.7  74.8  74.7  74.9   0.0
bblack@bast1002:~$ mtr -c 10 -r -6 bottomless.aa.net.uk
Start: Tue Dec  4 12:23:58 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.3   0.3   0.2   0.3   0.0
  2.|-- xe-0-1-5.cr2-eqord.wikime  0.0%    10   66.5  32.2  28.3  66.5  12.0
  3.|-- 10gigabitethernet4-1.core  0.0%    10   25.8  25.1  25.0  25.8   0.0
  4.|-- 100ge16-1.core1.nyc4.he.n 10.0%    10   25.2  25.2  25.1  25.3   0.0
  5.|-- 100ge16-2.core1.lon2.he.n  0.0%    10   92.2  92.4  92.1  93.2   0.0
  6.|-- k.aimless.thn.aa.net.uk    0.0%    10   92.4  92.4  92.3  92.5   0.0
  7.|-- bottomless.aa.net.uk       0.0%    10   91.1  91.2  91.1  91.3   0.0
bblack@bast1002:~$ mtr -c 10 -r -4 a.gormless.thn.aa.net.uk
Start: Tue Dec  4 12:24:41 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    7.7   1.2   0.3   7.7   2.3
  2.|-- ae0.cr1-eqiad.wikimedia.o  0.0%    10    0.2   0.2   0.2   0.3   0.0
  3.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
bblack@bast1002:~$ mtr -c 10 -r -6 a.gormless.thn.aa.net.uk
Start: Tue Dec  4 12:25:03 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.2   0.3   0.2   0.3   0.0
  2.|-- xe-0-1-5.cr2-eqord.wikime  0.0%    10   28.3  33.5  28.3  79.0  16.0
  3.|-- 10gigabitethernet4-1.core  0.0%    10   25.0  25.1  25.0  25.6   0.0
  4.|-- 100ge16-1.core1.nyc4.he.n  0.0%    10   25.1  25.1  25.1  25.3   0.0
  5.|-- 100ge16-2.core1.lon2.he.n  0.0%    10   92.1  94.1  92.1 111.1   5.9
  6.|-- k.aimless.thn.aa.net.uk   10.0%    10   92.4  92.4  92.2  92.5   0.0
  7.|-- a.gormless.thn.aa.net.uk   0.0%    10   91.4  91.3  91.2  91.4   0.0

So, the return paths are different. Over IPv4 it goes straight from Ashburn -> NYC -> LON using GTT. Over IPv6 it gets a longer path via Ashburn -> Chicago -> NYC -> LON via Hurricane Electric.

(But note that first hop from Ashburn to Chicago is our routers' choice, so it's possible some of our route engineering is at play here).

faidon renamed this task from IPv6 ~20ms higher ping than IPv4 to gerrit on last ntt hop to IPv6 ~20ms higher ping than IPv4 to gerrit.Dec 4 2018, 1:49 PM
faidon triaged this task as High priority.

The forward paths are nearly identical, but the reverse is not: reverse path selection is HE for IPv6 and NTT for IPv4, so different paths, and latency could be reasonably explained by that.

However, it seems that the path through HE goes over eqord, even though the destination is in London (and eqiad having connectivity with HE). Digging in a little further on that, it looks like the route from eqord has a localpref 250, while the local eqiad route a localpref of 100. This, in turn, is explained by this:

faidon@cr2-eqord> show route 2001:8b0::/32                        
[…]
2001:8b0::/32      *[BGP/170] 3d 09:47:18, MED 0, localpref 250, from 2001:504:0:4:ffff:ffff:ffff:1
                      AS path: 6939 20712 ?, validation-state: unverified
                    > to 2001:504:0:4::6939:1 via xe-0/1/4.0
                    [BGP/170] 3d 09:47:18, MED 0, localpref 250, from 2001:504:0:4:ffff:ffff:ffff:2
                      AS path: 6939 20712 ?, validation-state: unverified
                    > to 2001:504:0:4::6939:1 via xe-0/1/4.0
[…]
                    [BGP/170] 3d 09:47:19, localpref 100
                      AS path: 6939 20712 ?, validation-state: unverified
                    > to 2001:504:0:4::6939:1 via xe-0/1/4.0
[…]

So, while the intention for the routes learned from HE in eqord are to have a localpref of 100, we also learn those routes via the route servers on the very same router and thus they get a 250. This is clearly a bug.

We need to either make some sort of exception (e.g. filter those routes in the route servers if Equinix allows that, or change the localpref based on AS Path, or something like that) or implement T204281 which would implicitly address (or hide under the rug? :) all of this too.

Some thoughts here:

It would be ideal to differentiate between peering routes & transit routes in our HE peering and mark those appropriately (with our peering communities, localpref etc.). That would bring the peering routes in both eqiad and eqord, both over our peering & IXP, to localpref 250 in our current configuration, and address this issue. However, it does not seem like HE tags their routes with any communities to that effect, so that is not currently possible.

Another option would be to introduce a PEERING-BLACKLIST aspath list that would be matched in BGP_IXP_in (but not BGP_transit_in) and would apply a localpref of 80 or 90 for AS paths we'd like to avoid choosing peering over. That has the problem, however, that it would need to be site-specific, and may have unintended consequences if e.g. the BGP peering with 6939 is down. It's possible, but a tiny bit needs more thought.

Or maybe we should just get rid of the peering vs. transit localpref issue and not have to do any of this for now :)

I'm all for testing T204281, but it's probably wise to wait for January for that.

Until then, a temporary fix can be to move HE from the peering group to the transit group.

Talked to Faidon last week, we agreed that a mechanism to ignore AS paths learned from the route servers would be a useful thing to have and not only a hotfix for this issue.
Not tested but I *think* this would work, reviews welcome. The main thing I'm not 100% sure about is the order of import.
If they are on the same level, such as import [ BGP_sanitize_in BGP_IXP_in BGP_community_actions ]; they are processed in order (left to right).
Applying import BGP_IX_RS_in at a higher level *should* import it fist, and then the less specific.
If it doesn't, then we could apply import [ BGP_sanitize_in BGP_IXP_RS_in BGP_community_actions ]; and add an explicit permit at the end of BGP_community_actions

In this special HE case, this change comes at the cost of downpref-ing HE's IPv4 routes if applied to all RS.

Temporary workaround (until T204281 ?) could be to apply BGP_IX_RS_in only to IPv6 route servers.

cr2-eqdfw
[edit protocols bgp group IX6 neighbor 2001:504:0:4:ffff:ffff:ffff:1]
+     import BGP_IX_RS_in;
[edit protocols bgp group IX6 neighbor 2001:504:0:4:ffff:ffff:ffff:2]
+     import BGP_IX_RS_in;
[edit policy-options policy-statement BGP_IXP_in then]
-     local-preference 250;
-     default-action accept;
+     next policy;
[edit policy-options]
+   policy-statement BGP_IX_RS_in {
+       term avoid-paths-ix-rs {
+           from as-path-group AVOID-PATHS-IX-RS;
+           then {
+               community add AVOIDED_PATH;
+           }
+       }
+       then next policy;
+   }
[edit policy-options]
    as-path-group SELECTED-PATHS { ... }
+   as-path-group AVOID-PATHS-IX-RS {
+       as-path NONE 0;
+       as-path HE "6939 .*";
+   }

1/ Define the as-paths we want to ignore from the route-servers
2/ Apply the policy BGP_IX_RS_in to the route servers only, which only adds a community on target routes
3/ Fix a missconfiguration in BGP_IXP_in, that directly applies the local-pref and accept the route directly, instead of only adding a community and move to the next policy BGP_community_actions to apply local-pref and implicitly accept the route.

  • It's been a while, but I believe an import statement in the neighbor block overrides the parent one in its entirety, and does not supplement it, so we'd have to repeat the whole import chain there.
  • Would it make sense to have separate as-path groups for v4/v6? It's a bit unusual in our config, but it would address the issue with HE and to inadvertently avoid downprefing HE for IPv4 for no reason.
  • If we're going to remove the local-preference setting from BGP_IXP_in and just rely on BGP_community_actions to apply based on communities (it's a good idea!), then we should probably do the same for BGP_Private_Peer_in for consistency.
  • Nitpick: the non-RS policies are called BGP_IXP_…, so let's follow that naming scheme (i.e. "BGP_IXP_RS_in", not "IX")
  • It's been a while, but I believe an import statement in the neighbor block overrides the parent one in its entirety, and does not supplement it, so we'd have to repeat the whole import chain there.

Noted, that is not an issue as it doesn't make the configuration much more complex.

  • Would it make sense to have separate as-path groups for v4/v6? It's a bit unusual in our config, but it would address the issue with HE and to inadvertently avoid downprefing HE for IPv4 for no reason.

I think it would be better to avoid having configuration specific to the HE usecase, especially if T204281 could solve the issue.
Maybe only apply it to the v6 peers until we try T204281 and then revisit how we should tackle it if it's not satisfying.

  • If we're going to remove the local-preference setting from BGP_IXP_in and just rely on BGP_community_actions to apply based on communities (it's a good idea!), then we should probably do the same for BGP_Private_Peer_in for consistency.

Indeed!

  • Nitpick: the non-RS policies are called BGP_IXP_…, so let's follow that naming scheme (i.e. "BGP_IXP_RS_in", not "IX")

Noted.

Mentioned in SAL (#wikimedia-operations) [2018-12-11T16:50:24Z] <XioNoX> replace local-preference/default-action by next policy for BGP_IXP_in and BGP_Private_Peer_in on cr4-ulsfo - T211079

Mentioned in SAL (#wikimedia-operations) [2018-12-11T17:05:52Z] <XioNoX> remove redundant term classification from BGP_transit_in on cr4-ulsfo - T211079

Mentioned in SAL (#wikimedia-operations) [2018-12-11T17:23:06Z] <XioNoX> push changes tested on cr4-ulsfo to all routers - T211079

Left to push:

cr2-eqord
[edit protocols bgp group IX6 neighbor 2001:504:0:4:ffff:ffff:ffff:1]
+     import [ BGP_sanitize_in BGP_IXP_RS_in BGP_IXP_in BGP_community_actions ];
[edit protocols bgp group IX6 neighbor 2001:504:0:4:ffff:ffff:ffff:2]
+     import [ BGP_sanitize_in BGP_IXP_RS_in BGP_IXP_in BGP_community_actions ];
[edit policy-options]
+   policy-statement BGP_IXP_RS_in {
+       term avoid-paths-ixp-rs {
+           from as-path-group AVOID-PATHS-IXP-RS;
+           then {
+               community add AVOIDED_PATH;
+           }
+       }
+       then next policy;
+   }
[edit policy-options]
    as-path-group SELECTED-PATHS { ... }
+   as-path-group AVOID-PATHS-IXP-RS {
+       as-path NONE 0;
+       as-path HE "6939 .*";
+   }

To be adapted for other sites: AVOID-PATHS-IX-RS group, different IX neighbors RS IPs

EDIT: BGP_IXP_in also needs to be applied to the RS

Mentioned in SAL (#wikimedia-operations) [2018-12-11T18:23:55Z] <XioNoX> push BGP_IXP_RS_in to cr2-eqord - T211079

Confirmed working, return path now takes NTT back and is ~17ms faster (to the last hop of previously shared traceroutes).

bast1002:~$ mtr a.gormless.thn.aa.net.uk --report-wide -6
Start: Tue Dec 11 18:29:07 2018
HOST: bast1002                                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikimedia.org            0.0%    10    0.2   1.0   0.2   7.8   2.3
  2.|-- ae0.cr1-eqiad.wikimedia.org                 0.0%    10    1.3   0.5   0.3   1.3   0.0
  3.|-- xe-0-0-28-0.a03.asbnva02.us.bb.gin.ntt.net  0.0%    10    0.6   4.2   0.2  33.1  10.3
  4.|-- ae-70.r06.asbnva02.us.bb.gin.ntt.net        0.0%    10    0.6   0.6   0.6   0.7   0.0
  5.|-- ae-2.r22.asbnva02.us.bb.gin.ntt.net         0.0%    10    0.4   0.4   0.4   0.5   0.0
  6.|-- ae-5.r25.nycmny01.us.bb.gin.ntt.net         0.0%    10    6.5   6.5   6.4   6.5   0.0
  7.|-- ae-1.r24.nycmny01.us.bb.gin.ntt.net         0.0%    10    9.5   6.6   6.2   9.5   1.0
  8.|-- ae-9.r24.londen12.uk.bb.gin.ntt.net         0.0%    10   74.8  75.0  74.7  76.7   0.5
  9.|-- ae-1.r04.londen05.uk.bb.gin.ntt.net         0.0%    10   73.5  73.3  73.2  73.5   0.0
 10.|-- e.aimless.tch.aa.net.uk                    30.0%    10   74.2  74.2  74.1  74.2   0.0
 11.|-- a.gormless.thn.aa.net.uk                    0.0%    10   73.2  73.1  73.0  73.4   0.0
cr2-eqord# run show route 2001:8b0::/32 
2001:8b0::/32      *[BGP/170] 1w3d 14:28:40, localpref 100
                      AS path: 6939 20712 ?, validation-state: unverified
                    > to 2001:504:0:4::6939:1 via xe-0/1/4.0
[...]
                    [BGP/170] 00:04:56, MED 0, localpref 50, from 2001:504:0:4:ffff:ffff:ffff:1
                      AS path: 6939 20712 ?, validation-state: unverified
                    > to 2001:504:0:4::6939:1 via xe-0/1/4.0
                    [BGP/170] 00:04:56, MED 0, localpref 50, from 2001:504:0:4:ffff:ffff:ffff:2
                      AS path: 6939 20712 ?, validation-state: unverified

Mentioned in SAL (#wikimedia-operations) [2018-12-11T18:54:13Z] <XioNoX> push BGP_IXP_RS_in to all routers (but don't apply it to any peers, needs to be done manually) - T211079

Mentioned in SAL (#wikimedia-operations) [2018-12-11T19:57:15Z] <XioNoX> apply BGP_IXP_RS_in and avoid HE to cr4-ulsfo - T211079

The issue is not present in eqdfw, eqiad, esams, as HE is not sending those routes through the RS.
Pushing the "avoid HE prefixes from the RS" change to those sites to ensure the issue doesn't show up if for some reasons we start getting HE routes via the those RS.

ayounsi changed the task status from Open to Stalled.Dec 11 2018, 8:31 PM

All done, marking the task as stalled until T204281

Actually, this can be closed.