Page MenuHomePhabricator

drmrs: initial geodns configuration
Closed, ResolvedPublic

Description

Initial geodns configuration for allowing some public traffic flow into our new drmrs site!

  • Define drmrs in geodns config, mapping only our own internal drmrs networks to it (already done in https://gerrit.wikimedia.org/r/c/operations/dns/+/771342 )
  • Enable as primary destination for Cyprus as initial public traffic test
  • Expand initial testing to PT (Portugal)
  • Expand initial testing to ES (Spain)
  • Expand initial testing to FR (France)
  • Enable as fallback destination when esams is offline, replacing the old geo-maps-esams-offline hack
  • Do some esams->drmrs failover testing!

Beyond this, probably in separate subtasks that will take a bit longer:

  • Research latency estimations and create a reasonably-optimal map for sending all of the appropriate countries to drmrs as their primary choice
  • Phase this map in over several commits to ramp up to the full map
  • If esams/drmrs load split seems excessively uneven, consider moving some large-traffic cases in western europe, where the latency diff is small, to help level them out better.

Event Timeline

Change 771354 had a related patch set uploaded (by BBlack; author: Ayounsi):

[operations/dns@master] GeoDNS Cyprus to drmrs

https://gerrit.wikimedia.org/r/771354

Change 771631 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] geodns: remove geo-maps-esams-offline hack

https://gerrit.wikimedia.org/r/771631

Change 771632 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] geodns: add drmrs fallback for esams to whole map

https://gerrit.wikimedia.org/r/771632

Change 771354 merged by BBlack:

[operations/dns@master] GeoDNS Cyprus to drmrs

https://gerrit.wikimedia.org/r/771354

Change 771672 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] geodns: maxmind now has CY in EU rather than AS

https://gerrit.wikimedia.org/r/771672

Change 771672 merged by BBlack:

[operations/dns@master] geodns: maxmind now has CY in EU rather than AS

https://gerrit.wikimedia.org/r/771672

jbond triaged this task as Medium priority.Mar 21 2022, 11:58 AM

Arzhel and I discussed this a bit, and we're going add a few more countries manually for now before proceeding with the esams-resiliency patches. Arzhel identified PT, ES, and FR are good targets - they'll get us significantly more traffic than drmrs has now and keep caches hotter, and they all see at least some latency improvement @ drmrs. We'll roll these out in that order (smaller-sized first), and then see about working on the esams failover parts!

Change 772876 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] map Portugal to drmrs

https://gerrit.wikimedia.org/r/772876

Change 772876 merged by BBlack:

[operations/dns@master] map Portugal to drmrs

https://gerrit.wikimedia.org/r/772876

Change 773244 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] map Spain to drmrs

https://gerrit.wikimedia.org/r/773244

Change 773245 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] map France to drmrs

https://gerrit.wikimedia.org/r/773245

PT was pretty smooth, ES likely to be later today, closer to when their daily traffic cycle begins to trend downwards.

Change 773244 merged by BBlack:

[operations/dns@master] map Spain to drmrs

https://gerrit.wikimedia.org/r/773244

Change 773245 merged by BBlack:

[operations/dns@master] map France to drmrs

https://gerrit.wikimedia.org/r/773245

For esams failover testing: we're planning to attempt this on Thursday. The idea is to merge the oustanding patches and then depool esams when it's ~halfway or more through its daily downslope in traffic, and then keep an eye out for any transit saturation (due to transit imbalances that might need some route engineering), especially when the next daily upslope starts.

Graph of the daily cycle of esams+drmrs requests: https://w.wiki/4$8e

Plan details:

(all times UTC, starting on Thursday and then rolling through to Friday!)

  1. ~22:xx or so - merge https://gerrit.wikimedia.org/r/c/operations/dns/+/771631 + https://gerrit.wikimedia.org/r/c/operations/dns/+/771632 (makes drmrs the fallback choice for esams)
  2. ~23:00 - Depool esams in geodns using the normal admin_state mechanism (this is a little past halfway down the daily traffic downslope, usually).
  3. ~23:00 - ~06:00 - We're in the deep part of the daily low through most of this, pretty low risk, and caches will be gaining some contents that will be useful later.
  4. ~06:00 - This is roughly the point in time on the upswing where we'll see new traffic levels higher than what we had at the outset at 23:00.
  5. ~09:00 - ~21:00 - This is roughly the peak traffic period (a plateau with some inflection points).
  6. ~23:00 - re-pool esams and end experiment, having survived 24h.

Again, the primary risk is traffic imbalance leading to the saturation of a single transit. There's smaller risks of saturating a peering or transport connection, but those scenarios seem less-likely. The ports to keep an eye on:

Transits:
https://librenms.wikimedia.org/device/device=239/tab=port/port=23135/
https://librenms.wikimedia.org/device/device=239/tab=port/port=23133/
https://librenms.wikimedia.org/device/device=240/tab=port/port=23199/

Peerings:
https://librenms.wikimedia.org/device/device=239/tab=port/port=23132/
https://librenms.wikimedia.org/device/device=240/tab=port/port=23197/

Transports:
https://librenms.wikimedia.org/device/device=239/tab=port/port=23134/
https://librenms.wikimedia.org/device/device=240/tab=port/port=23198/

If one of them looks likely to (or does!) saturate, or if any other mystery arises, we can re-pool esams to alleviate. It should go smoothly. There should be (we plan to have) some Traffic folks around for most of this, but in any case, anyone can simply revert the esams depool patch if there's issues!

Change 771631 merged by BBlack:

[operations/dns@master] geodns: remove geo-maps-esams-offline hack

https://gerrit.wikimedia.org/r/771631

Change 771632 merged by BBlack:

[operations/dns@master] geodns: add drmrs fallback for esams to whole map

https://gerrit.wikimedia.org/r/771632

Change 776003 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] Remove one last esams-offline note

https://gerrit.wikimedia.org/r/776003

Change 776004 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] Depool esams to test drmrs at full EMEA load

https://gerrit.wikimedia.org/r/776004

Change 776003 merged by BBlack:

[operations/dns@master] Remove one last esams-offline note

https://gerrit.wikimedia.org/r/776003

Change 776004 merged by BBlack:

[operations/dns@master] Depool esams to test drmrs at full EMEA load

https://gerrit.wikimedia.org/r/776004

Mentioned in SAL (#wikimedia-operations) [2022-03-31T23:01:11Z] <bblack> esams->drmrs failover test begins - T304089

Change 776009 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/dns@master] Remove esams-offline note from README

https://gerrit.wikimedia.org/r/776009

Change 776009 merged by BBlack:

[operations/dns@master] Remove esams-offline note from README

https://gerrit.wikimedia.org/r/776009

Note - we've made a last-minute change of plans about the timeline of the experiment, and decided to shorten it by one hour. We'll be re-pooling esams at ~22:00 UTC today, not ~23:00 as originally planned. Rationale is something like:

  • At 22:00, we should be past the true peak of the daily EMEA cycle, so there's little value in that extra hour of testing
  • Our caches cap all object TTLs at 24h
  • Therefore, if we wait a full 24h before re-pooling the (currently user-free) esams, virtually everything in the caches there will be expired.
  • Even if we had waited the full 24h, the caches would re-use some stale objects to help with inrush, but this would still trigger a large spike in background fetches to the applayer, risking transport saturation or application load impacts from the esams inrush.
  • Those risks, late on a Friday, aren't worth it for the near-zero value of the extra hour of testing when we're already past the peak traffic time.

Change 776045 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/dns@master] Revert "Depool esams to test drmrs at full EMEA load"

https://gerrit.wikimedia.org/r/776045

Change 776045 merged by BBlack:

[operations/dns@master] Revert "Depool esams to test drmrs at full EMEA load"

https://gerrit.wikimedia.org/r/776045

Test concluded, and esams is re-pooled. More analysis and planning to follow next week I'm sure, but the basic highlights are:

  • We went 23h (including all of the daily peak period) with all the EMEA traffic on drmrs whlie esams was fully de-pooled
  • No major issues
  • Telia transit peaked at ~6.6Gbps outbound, was the highest by far.
  • Total transit+peering peak somewhere around 16Gbps (rough manual sum from librenms graphs)
  • @ayounsi did some minor traffic engineering to offload our heaviest transit link out of caution during the ramp-up - https://gerrit.wikimedia.org/r/c/operations/homer/public/+/776157/1/templates/includes/policies/drmrs-paths.conf
  • Basically proves we've got a reliable plan for site-level redundancy within EMEA now, and can depool either of esams or drmrs without having their bulk traffic shifted back across the pond to eqiad in the US, which adds a lot of latency for users and has historically had a tendency to cause various saturation issues.

Thanks @MMandere and @ssingh for providing coverage for various portions of the 23h test window, and Arzhel for keeping an eye on transit balance!

ayounsi claimed this task.

I think everything here is done, and follow up is in T311472: DRMRS: Geodns Configuration -- Phase 2
Feel free to re-open if needed.