Page MenuHomePhabricator

Stop prioritizing peering over transit
Closed, ResolvedPublic

Description

Long due follow up from T186835#4089227 (and paraphrasing most of it).

When investigating routing/peering in eqsin, I noticed that some prefixes were taking sub-optimal paths, for example:

202.70.77.250 - 119.0ms - 6279
  * 9498 23752 23752 23752 23752 23752 23752 23752 23752 ?   (local-pref 250)
    3491 9498 23752 23752 23752 23752 23752 23752 23752 23752 ? (local-pref 250)
    6453 23752 ?   (local-pref 100)
103.202.217.6 - 186.0ms - 32819
  * 3491 41095 59103 59105 59105 59105 59105 ?   (local-pref 250)
    6453 2518 59105 ?   (local-pref 100)

The shorter AS path isn't the chosen one as it has a lower local-pref.
This is due to our current policy to prioritize peering over transit, and might not be relevant anymore:

  1. If the prefix is learned from a peer, its AS path will most often be shorter (because less middlemen)
  2. Prioritizing them override the destination network's traffic engineering policies, as we can see in the example above (we ignore the AS-prepending) and could hit bottleneck or sub-optimal routing
  3. Requires customs tuning to workaround those sub-optimal routing (when we notice them)
  4. cost savings of sending traffic through free links (vs. paid transit) are null (far from commit) as long as no massive change
  5. Increases the configuration and routing complexity (various rules and routing decisions)

I'm thinking that we should not prioritize peers (especially ones operating at a large geographical scope) over transit in term of local-pref (use the default value of 100), and not prioritize them (local pref 250 as of right now).

A test has been done previously in eqsin (see. T186835#4121297 ) and I'd like to do the same test on a larger scale, at least esams, at best globally.

The 3 aspects to monitor are:

  1. Link capacity, (eg. traffic shifts and saturates a transit link)
  2. Transits commits (billing)
  3. Performances (are we seeing any improvement or degradation)

The former has been verified and we're have plenty of capacity, the latter would require the help of the performance team.

The test would be successful if no performance degradation nor massive traffic shift occurs (none are expected).

Event Timeline

ayounsi triaged this task as Medium priority.Sep 13 2018, 9:32 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptSep 13 2018, 9:32 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Sounds interesting. Keep Perf in the loop as you start to think about how to do this, and what your target geos might be.

The following need to be pushed to routers to stop adding a higher local_pref to routes learned via peering (IXP+private).

[edit policy-options policy-statement BGP_community_actions]
!     inactive: term peer-private-peer { ... }
!     inactive: term peer { ... }

If successful those terms could be deleted instead of deactivated.

Gilles added a subscriber: Gilles.Jan 10 2019, 7:52 PM

Following @ayounsi's request, I've put together per-DC real user monitoring performance metrics using the following Hive query:

SELECT day, SUBSTR(recvfrom, 8, 5) AS dc, PERCENTILE(event.responseStart - event.connectStart, 0.5) AS median_ttfb, PERCENTILE(event.loadEventStart - event.responseStart, 0.5) AS median_plt FROM event.navigationtiming WHERE year = 2019 AND month = 1 AND day < 11 GROUP BY day, SUBSTR(recvfrom, 8, 5);

Which results in the following data: https://docs.google.com/spreadsheets/d/1LfVw4nWzqH7Tp9hgCNkbs5M5X9oJGVlwQYq_kIsj9vE/edit?usp=sharing

We can see that the 2 key metrics likely to be affected by this change (time-to-first-byte and page load time) are very stable per-DC. This is a sufficient baseline to verify the effect of this change. After a few days of the change being effective for a DC (eg. esams) we can run this query again and see if any global change occured.

elukey added a subscriber: elukey.Jan 11 2019, 3:02 PM
ayounsi updated the task description. (Show Details)Jan 14 2019, 11:35 PM

Scheduling the work on Tuesday January 22nd, 16:00UTC, scope is Amsterdam only.
That gives us the remaining of the week to monitor for any issue. Then collect/analyze results after the All Hands.

Mentioned in SAL (#wikimedia-operations) [2019-01-22T16:12:31Z] <XioNoX> deactivate local pref for peering sessions in es/knams - T204281

ayounsi added a comment.EditedJan 22 2019, 5:19 PM

First observation shows a ~6/800Mbps traffic shift from peering to transit, which is small and within expected range.

Gilles added a comment.Feb 6 2019, 8:15 AM

I've updated the Google Spreadsheet with the figures up to yesterday. It seems like nothing changed from the end users' perspective in terms of median time-to-first-byte, it's in the same range as before the change. Same for median page load time.

I think you can go ahead and roll out that change to all DCs.

Mentioned in SAL (#wikimedia-operations) [2019-02-19T05:17:35Z] <XioNoX> deleted previously deactivated BGP_community_actions terms - T204281

Doing the change in ulsfo:

[edit policy-options policy-statement BGP_community_actions]
-    term peer-private-peer {
-        from community PEER_PRIVATE_PEER;
-        then {
-            local-preference 250;
-            next policy;
-        }
-    }
-    term peer {
-        from community PEERING_ROUTE;
-        then {
-            local-preference 250;
-            next policy;
-        }
-    }

For reference, here are all transit ports stacked:
https://librenms.wikimedia.org/graphs/id=7219,16765,7159/type=multiport_bits_separate/
And here is peering:
https://librenms.wikimedia.org/graphs/id=16787/type=port_bits/

Mentioned in SAL (#wikimedia-operations) [2019-02-19T05:31:11Z] <XioNoX> delete local pref for peering sessions in ulsfo - T204281

This caused a ~80Mbps traffic drop on the peering link.

Mentioned in SAL (#wikimedia-operations) [2019-02-26T22:13:06Z] <XioNoX> delete local pref for peering sessions in eqsin - T204281

eqsin, the private-peer term has been removed a while back to do traffic engineering specific to this site.

[edit policy-options policy-statement BGP_community_actions]
-    term peer {
-        from community PEERING_ROUTE;
-        then {
-            local-preference 250;
-            next policy;
-        }
-    }

For reference, here are all transit ports stacked:
https://librenms.wikimedia.org/graphs/id=17836,17835,13948/type=multiport_bits_separate/
And here is peering:
https://librenms.wikimedia.org/graphs/id=13958,17840/type=multiport_bits_separate/

~80Mbps traffic shift to transit too.

Did something happen on 2019-02-21? I was looking at the ulsfo RUM perf metrics for the change in that DC. No change is apparent after the 2019-02-19 config change, except that particular day (the 21st) that stands out as having noticeably bad performance for ulsfo. Pretty much the same TTFB as esqin users on that day.

Circa 2019-02-21, eqsin was depooled to install a new router, and most of the users normally mapped to eqsin had fallen back to ulsfo temporarily, which would distort the stats of "ulsfo users" considerably.

Mentioned in SAL (#wikimedia-operations) [2019-02-27T20:53:56Z] <XioNoX> delete local pref for peering sessions in codfw/eqdfw - T204281

cr1/2-codfw + cr2-eqdfw

[edit policy-options policy-statement BGP_community_actions]
-    term peer-private-peer {
-        from community PEER_PRIVATE_PEER;
-        then {
-            local-preference 250;
-            next policy;
-        }
-    }
-    term peer {
-        from community PEERING_ROUTE;
-        then {
-            local-preference 250;
-            next policy;
-        }
-    }

Peering:
https://librenms.wikimedia.org/graphs/id=16721/type=port_bits/

Transit:
https://librenms.wikimedia.org/graphs/id=8288%2C8209%2C16334/type=multiport_bits_separate/

Because of the way codfw/eqdfw peers with eqiad and eqord and the fact we have similar peers in several IXPs
Removing the preferred local-pref in Dallas caused the IX traffic to shift to the peering points in Chicago and (less) Ashburn.

There is no risk of saturation, but routing to those AS# is now sub-optimal, until the same change is applied to eqord and eqiad.
Seeing how stable (and with little traffic shift) this rollout has been in other site, I'll push the same change in eqord, then eqiad.

Mentioned in SAL (#wikimedia-operations) [2019-02-27T21:26:35Z] <XioNoX> delete local pref for peering sessions in eqord - T204281

Mentioned in SAL (#wikimedia-operations) [2019-02-27T21:57:21Z] <XioNoX> delete local pref for peering sessions in eqiad - T204281

Shift from peering to transit are:
~200Mbps in eqdfw
~300Mbps in eqord
~300Mbps in eqiad

mark added a subscriber: mark.Mar 1 2019, 12:36 PM
ayounsi closed this task as Resolved.Mar 5 2019, 12:50 AM

Everything here is done. Will reopen if any signs of issues down the road.