Page MenuHomePhabricator

improve GeoDNS-to-edge mapping
Open, LowPublic

Description

Currently, here's how users get 'mapped' to a specific CDN point of presence:

  • We get a DNS request from that user's resolver.
  • We use Maxmind to geolocate that resolver's IP address
  • We return the IP address of the geographically-closest edge location

This has the following issues:

  • The resolver IP address used doesn't necessarily correlate with user location (for instance, the user might use Google Public DNS or a similar service)
  • No IP geolocation service has perfect accuracy
  • Geographically-closest is often but isn't always lowest-latency

It's been discussed many times over the years to improve upon this. Here's a proposal for how we might do so, based on both Arzhel's design document from 2021 and also a conversation on #wikimedia-traffic IRC a few weeks ago.

  • Create site-specific subdomains for all of our sites, perhaps something like edge-timing-ulsfo.wikimedia.org (although I think we probably want to use a different domain name for such purposes; see also T263847 and T292866)
  • Configure those domains to serve NEL response headers setting both failure_fraction and success_fraction to 1.0, with reports going into our existing NEL pipeline (see T257527). Configure a long TTL on that policy, so that reporting any failures actually happens.
  • Write and deploy some client-side JS that (with a small probability on each pageview?) might fetch a small piece of content from each of our edge sites on those domains. (It's possible we could reuse some parts of Probnik for this, although the tool itself seems to be stagnant since 2019.)
  • Complete T304373 so that NEL data is available in Analytics
  • Design and implement a pipeline in Analytics that will aggregate NEL reports, cares about how many samples we get from a network, and decaying weight of older samples, etc, to generate something like our geo-maps file (but for networks, not countries) (file format TBD; it's possible it would be best to use something like mmdbwriter for instance)
  • Use that file to serve GeoDNS responses, after some careful evaluation of the impact of this change (probably with a few iterations of improvement/fixes)
  • Later on: use Alt-Svc to 'solve' the cases where the resolver location<>user location mapping is very wrong (see T208242)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis updated the task description. (Show Details)

We did a somewhat experimental version of this work as @JameelKaisar's intern project in T332024: GeoIP mapping experiments and friends. The infrastructure pieces have mostly gone untouched since then, although we've used the data for some work like magru optimization (T363722)

An annotated version of the original task description here:
  • Create site-specific subdomains for all of our sites, perhaps something like edge-timing-ulsfo.wikimedia.org (although I think we probably want to use a different domain name for such purposes; see also T263847 and T292866)

Done in T332025 -- for example measure-magru.wikimedia.org. They are CNAMEs to upload-lb.

  • Configure those domains to serve NEL response headers setting both failure_fraction and success_fraction to 1.0, with reports going into our existing NEL pipeline (see T257527). Configure a long TTL on that policy, so that reporting any failures actually happens.

Done in T334608, although ultimately we rely on direct instrumentation from the client-side JS (see also T337317#8958296)

  • Write and deploy some client-side JS that (with a small probability on each pageview?) might fetch a small piece of content from each of our edge sites on those domains. (It's possible we could reuse some parts of Probnik for this, although the tool itself seems to be stagnant since 2019.)

Done in T334417 -- the probenet code is shipped as part of Extension:WikimediaEvents

  • Complete T304373 so that NEL data is available in Analytics

Not necessary because of using EventGate (via Extension:EventLogging) from the client side JS.

  • Design and implement a pipeline in Analytics that will aggregate NEL reports, cares about how many samples we get from a network, and decaying weight of older samples, etc, to generate something like our geo-maps file (but for networks, not countries) (file format TBD; it's possible it would be best to use something like mmdbwriter for instance)

Not even close to done. But I can offer some Jupyter notebooks and Python scripts that aggregate the data in useful ways and produce plots.

  • Use that file to serve GeoDNS responses, after some careful evaluation of the impact of this change (probably with a few iterations of improvement/fixes)
  • Later on: use Alt-Svc to 'solve' the cases where the resolver location<>user location mapping is very wrong (see T208242)

Oh, and one related thing, we should fix T347114 -- I think that's just a VCL change.