Overview
The overall goal of the project: gather latency data directly from end-users to help improve our mapping of users to CDN sites.
Goals:
- Quickly build & deploy an experimental infrastructure for collecting real-user latency measurements towards all CDN sites
- Should also be easy to replace or tear down
- Measure latency on small fetches only -- no large object fetches, bandwidth testing, etc
- Directly reporting network RTT not necessary, as long as whatever we do measure is well-correlated with RTT and/or the overall "user experience" of using the site
- Try out a few different reporting mechanisms (at least in the early stages)
- Flexibility in the system's choice of when to take measurements
- Uniform sampling across all users is not ideal:
- we believe that regions/networks with small # of users correlate with lower-quality GeoIP & RIPE Atlas data
- setting a uniform sampling rate high enough to capture those small regions/networks will mean collecting far too many datapoints from larger networks
- Uniform sampling across all users is not ideal:
- Ideal end goal (somewhat stretchy): gather some actual data, generate a report of where our current mapping could be most improved
Non-goals:
- A ready-for-full-scale measurement & reporting system
- A completed, productionized pipeline for synthesizing latency measurements into GeoIP/GeoDNS mappings
- This includes not worrying about the indirection between end users and resolvers
Rough plan of work
- Configure our CDN to allow measurements to be made
- Evaluate / experiment with possible results reporting mechanisms
- NEL success_fraction on the special measurement domains
- Probnik, or similar bespoke JS
- Build a mechanism for triggering measurement collection in the background of regular wiki pageviews
- Likely to be JS code within Mediawiki
- Seems desirable to allow the traffic stack to choose when to trigger a fetch, as we have easy access to GeoIP information there; this also decouples any triggering rules from Mediawiki code deploys
- Easiest and simplest communication method from CDN to JS: cookies
- Set a very-low-TTL (1 minute?) cookie to cause a measurement to happen in this session
- Set a longer-TTL (2 days? 1 week?) cookie to inhibit measurements for the near future, to avoid any one user repeatedly incurring that cost