Currently, here's how users get 'mapped' to a specific CDN point of presence:
- We get a DNS request from that user's resolver.
- We use Maxmind to geolocate that resolver's IP address
- We return the IP address of the geographically-closest edge location
This has the following issues:
- The resolver IP address used doesn't necessarily correlate with user location (for instance, the user might use Google Public DNS or a similar service)
- No IP geolocation service has perfect accuracy
- Geographically-closest is often but isn't always lowest-latency
It's been discussed many times over the years to improve upon this. Here's a proposal for how we might do so, based on both Arzhel's design document from 2021 and also a conversation on #wikimedia-traffic IRC a few weeks ago.
- Create site-specific subdomains for all of our sites, perhaps something like edge-timing-ulsfo.wikimedia.org (although I think we probably want to use a different domain name for such purposes; see also T263847 and T292866)
- Configure those domains to serve NEL response headers setting both failure_fraction and success_fraction to 1.0, with reports going into our existing NEL pipeline (see T257527). Configure a long TTL on that policy, so that reporting any failures actually happens.
- Write and deploy some client-side JS that (with a small probability on each pageview?) might fetch a small piece of content from each of our edge sites on those domains. (It's possible we could reuse some parts of Probnik for this, although the tool itself seems to be stagnant since 2019.)
- Complete T304373 so that NEL data is available in Analytics
- Design and implement a pipeline in Analytics that will aggregate NEL reports, cares about how many samples we get from a network, and decaying weight of older samples, etc, to generate something like our geo-maps file (but for networks, not countries) (file format TBD; it's possible it would be best to use something like mmdbwriter for instance)
- Use that file to serve GeoDNS responses, after some careful evaluation of the impact of this change (probably with a few iterations of improvement/fixes)
- Later on: use Alt-Svc to 'solve' the cases where the resolver location<>user location mapping is very wrong (see T208242)