Page MenuHomePhabricator

GeoIP mapping experiments
Open, Needs TriagePublic

Description

Overview

The overall goal of the project: gather latency data directly from end-users to help improve our mapping of users to CDN sites.

Goals:
  • Quickly build & deploy an experimental infrastructure for collecting real-user latency measurements towards all CDN sites
    • Should also be easy to replace or tear down
  • Measure latency on small fetches only -- no large object fetches, bandwidth testing, etc
    • Directly reporting network RTT not necessary, as long as whatever we do measure is well-correlated with RTT and/or the overall "user experience" of using the site
  • Try out a few different reporting mechanisms (at least in the early stages)
  • Flexibility in the system's choice of when to take measurements
    • Uniform sampling across all users is not ideal:
      • we believe that regions/networks with small # of users correlate with lower-quality GeoIP & RIPE Atlas data
      • setting a uniform sampling rate high enough to capture those small regions/networks will mean collecting far too many datapoints from larger networks
  • Ideal end goal (somewhat stretchy): gather some actual data, generate a report of where our current mapping could be most improved
Non-goals:
  • A ready-for-full-scale measurement & reporting system
  • A completed, productionized pipeline for synthesizing latency measurements into GeoIP/GeoDNS mappings
    • This includes not worrying about the indirection between end users and resolvers

Rough plan of work

  • Configure our CDN to allow measurements to be made
  • Evaluate / experiment with possible results reporting mechanisms
    • NEL success_fraction on the special measurement domains
    • Probnik, or similar bespoke JS
  • Build a mechanism for triggering measurement collection in the background of regular wiki pageviews
    • Likely to be JS code within Mediawiki
    • Seems desirable to allow the traffic stack to choose when to trigger a fetch, as we have easy access to GeoIP information there; this also decouples any triggering rules from Mediawiki code deploys
      • Easiest and simplest communication method from CDN to JS: cookies
      • Set a very-low-TTL (1 minute?) cookie to cause a measurement to happen in this session
      • Set a longer-TTL (2 days? 1 week?) cookie to inhibit measurements for the near future, to avoid any one user repeatedly incurring that cost

Related Objects

Event Timeline

Hi @CDanis, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allow others to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

Hi, @CDanis. Thanks for creating this ticket. Would you mind expanding on the nature of the report? Thanks!

Brett, this project on which Jameel is working for his internship, is to collect latency data from users to all of our DCs. This will help improve our current GeoIP only mapping.

This is the main tracking task and we will expand it as the project progresses, you can ignore it for now.

Thanks for that, @ayounsi! Are you aware of https://gerrit.wikimedia.org/g/operations/software/latency-measurement ? It may or may not be relevant but I wanted to make sure it wasn't forgotten. :)

Thanks for that, @ayounsi! Are you aware of https://gerrit.wikimedia.org/g/operations/software/latency-measurement ? It may or may not be relevant but I wanted to make sure it wasn't forgotten. :)

Yes, we haven't forgotten :)

In any productionized version of the GeoIP mapper pipeline it's likely we'd also use RIPE Atlas data. However the focus of this task is enabling experiments with latency data sourced from real users.

I'll update this ticket with more detail on the overall plan this week.

Krinkle awarded a token.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle added a subscriber: Krinkle.

(I'm responding here in response to an email to the Peformance Team.)

This is an exciting project to see happen. We love meauring stuff and are happy to guide the integration bits with MediaWiki. I'll provide some feedback here on what I see in progress so far, and offer some possible next steps and avenues that may be worth exploring.

Measurement domain

From the task description:

"Create and deploy per-CDN-site DNS domains", T332025.

Makes sense to me overall.

I see you're going with tying it to the connection for upload-lb (from where we fetch thumbnails) rather than text-lb (from where we serve the canonical domains with their HTML, CSS/JS, and API routes). The task mentions load considerations on our cache proxy infra. I can't speak to that, I imagine upload-lb is the right call for those reasons indeed.

One point I would add, which may be worth weighing if you haven't already, is that by tying it to an existing connection (as opposed to neither, via a new IP+cert), you should expect some amount of network delay from head-of-line blocking in this case. Browsers generally share the underlying TLS+HTTP2 connection for efficiency, even across multiple domains, if they share the same IP and cert.

That may be fine at first glance. I expect the request/response messages from the pageview process to be similar across regions and population groups. In order to avoid bias from what you measure on a specific person's device, it'd be important to shuffle the order so that in whichever direction the contention is more priminent (my guess is that the first measurement probably gets more contention with chatter from the main process, but even if it's the other way around, shufling should account for that).

What's harder to account for is the contention with the measurement for the DC that the user is already assigned for their pageview. Even after priming the DNS cache and warming up the TLS/HTTP2 connection for the other DCs, it will remain the case that their generally assigned DC will have other stuff on the connection whereas the others don't. This could cause a bias such that the theoretical "true" second-best option might come out on top. Havig a separate IP and cert would not remove contention entirely, but it would reduce it to general (unbiased) bandwidth contention that affect all measurements equally.

See also:

Collecting measurements

We usually collect measurements in JavaScript, and beacons either using our very lightweight StatsV endpoint (e.g. using a few histogram buckets and increment the counts) which end up straight into Graphite/Grafana, or using a an EventGate schema, which then requires a bit of processing and then either send it to for long-term continuity and public visualiation through Prometheus/Grafana, or for one-off you can use Hadoop/Hive (CLI) or Turnilo.

From the task description:

"NEL success_fraction on the special measurement domains", T334608

Clever! It's great when use declarative/native reporting can work, avoiding the need for procedural instrumentation to be developed. It's not often that this is an option on the web platform. NEL (and maybe WebKit's Click Attribution experiment) are one of the very few cases where this works today. I didn't know that the W3C Network Error Logging API (NEL) provided a success_fraction feature. if that works, that could be the ideal here.

While the spec describes what it should report, it looked ambigious what it would report for a "success". To test this emperically I opened a Glitch (source) that sends to https://public.requestbin.com/r/enacmey4vt9b/.

From a quick glance at the data it produces, one concern may be that it's timing information is not very detailed. The output seems limited to the body.elapsed_time attribute. Are there are more timings we can gather with NEL? Maybe through some header configuration?

I believe NEL's elapsed_time is the time from fetchStart until the point in time where the "error" (or I guess in this case, success) is recorded, which is useful in relation to user experience, i.e. how long from when a user initiated a link navigation until it visibly failed, no matter where in the stack the specific timeout/error happened.

Another option might be to use the Resource Timing API. As I didn't know about NEL's "success" feature, this tends to be my go-to recommendation for measurements like this. It works like this.

var url = 'https://measure-eqiad.wikimedia.org/measure';
fetch( url ).then( resp => resp.text() ).finally( function () {
  var entry = performance.getEntriesByType( 'resource' ).find( res => res.name === url );
  console.log( entry );
} );
//> PerformanceResourceTimingPrototype {
//>   nextHopProtocol:  "h2",
//>   fetchStart:        61811,
//>   domainLookupStart: 61815, domainLookupEnd: 61952, # DNS
//>   connectStart:      61952,      connectEnd: 62126  # TCP + TLS
//>   responseStart:     62212                          # HTTP time to first byte

When we decide to take measurements against these domains, we'll serially fetch each site URL 3-5 times. This will provide us with some measurements that reuse the TCP + TLS sessions, giving us a truer estimate of RTT, and removing bias towards the primary CDN site (which the browser likely already has an existing session with).

With Resource Timing you can exclude DNS, e.g. responseStart - connectStart would give you the TCP/TLS/HTTP roundtrip time for a cold start after DNS. Or even connectStart->connectEnd if we don't need to measure HTTP specifically and the connection itself suffices. I guess that would suffice, but TCP+HTTP might be more representative of what we care about which is browsers making HTTP requests for pageviews.

If you go with this approach, I'd recommend distributing the JavaScript code via Extension:WikimediaEvents which is already enqueued on all pageviews. This is important as otherwise the new code would incur additional startup manifest overhead (performance guide), and would take 7-14 days to propagate to all pageviews. By adding it to WikimediaEvents the code is live globally within 5 minutes (RL architecture). You'll be limiting the instrumentation with sampling/conditionals, but the code will be live.

Feel free to tag me in code review on WikimediaEvents changes.

Storing measurements

If you go with NEL I guess, it'd naturally go through EventGate->Kafka and from there to Logstash. I don't know if NEL already goes to something more easily querable in aggregate yet, possibly Hadoop could be plugged in there if not already to e.g. perform Hive queries to calculate percentiles by region or visualise in Turnilo etc.

If you go with the JS approach, I'd recommend using StatsV for ingest. That could look something like this (documentation). I'm normally London-based, but visiting California as you can see:

// ulsfo 68
// codfw 183
// eqiad 274
// esams 500
// drmrs 549
// eqsin 570
var duration = 274;
mw.track( 'counter.geoip_mapping.eqiad.NL.le_300', 1 ); // 250-300ms bucket

This would then be immediately queryable in Grafana by DC and Country code, where you could stacking or compare bucket sizes by hour, day, week, etc. Our Graphite server has its retention/aggregation rules set up such that counts remain reliable over large periods of time.

If you antipate more complex analysis needs, then for the JS approach you'd want to go with the Event Platform (create an event schema plus EventStreams config), and send the beacon using mw.eventLog.logEvent (docs). It would then end up in Kafka and Hadoop from where you could compute aggregates with Hive SQL or something like that.

(I'm responding here in response to an email to the Peformance Team.)

This is an exciting project to see happen. We love meauring stuff and are happy to guide the integration bits with MediaWiki. I'll provide some feedback here on what I see in progress so far, and offer some possible next steps and avenues that may be worth exploring.

Thanks so much for the very detailed reply and the excitement ❤

Measurement domain

From the task description:

"Create and deploy per-CDN-site DNS domains", T332025.

Makes sense to me overall.

I see you're going with tying it to the connection for upload-lb (from where we fetch thumbnails) rather than text-lb (from where we serve the canonical domains with their HTML, CSS/JS, and API routes). The task mentions load considerations on our cache proxy infra. I can't speak to that, I imagine upload-lb is the right call for those reasons indeed.

One point I would add, which may be worth weighing if you haven't already, is that by tying it to an existing connection (as opposed to neither, via a new IP+cert), you should expect some amount of network delay from head-of-line blocking in this case. Browsers generally share the underlying TLS+HTTP2 connection for efficiency, even across multiple domains, if they share the same IP and cert.

That may be fine at first glance. I expect the request/response messages from the pageview process to be similar across regions and population groups. In order to avoid bias from what you measure on a specific person's device, it'd be important to shuffle the order so that in whichever direction the contention is more priminent (my guess is that the first measurement probably gets more contention with chatter from the main process, but even if it's the other way around, shufling should account for that).

What's harder to account for is the contention with the measurement for the DC that the user is already assigned for their pageview. Even after priming the DNS cache and warming up the TLS/HTTP2 connection for the other DCs, it will remain the case that their generally assigned DC will have other stuff on the connection whereas the others don't. This could cause a bias such that the theoretical "true" second-best option might come out on top. Havig a separate IP and cert would not remove contention entirely, but it would reduce it to general (unbiased) bandwidth contention that affect all measurements equally.

All good points.

At least for this initial version of the experiment we explicitly don't want to provision a new IP or get new certs; especially the former is a bunch of work.

So, how's this proposed measurement scheduling sound:

  • Wait until after window.onload, so all resources should have been fetched
  • After that, do several measurements serially against each CDN site (possibly in shuffled order), throwing away the results of the first probe against each site. With a short delay between probes.

I think this should remove both the bias towards the primary site (because of the already-existing TLS session), as well as removing the bias away from the primary site (because of queuing on the already-existing TLS session)

Additionally, I think this will minimize user impact for low-end devices where additional TLS session negotiations might be impactful?

Collecting measurements

We usually collect measurements in JavaScript, and beacons either using our very lightweight StatsV endpoint (e.g. using a few histogram buckets and increment the counts) which end up straight into Graphite/Grafana, or using a an EventGate schema, which then requires a bit of processing and then either send it to for long-term continuity and public visualiation through Prometheus/Grafana, or for one-off you can use Hadoop/Hive (CLI) or Turnilo.

From the task description:

"NEL success_fraction on the special measurement domains", T334608

Clever! It's great when use declarative/native reporting can work, avoiding the need for procedural instrumentation to be developed. It's not often that this is an option on the web platform. NEL (and maybe WebKit's Click Attribution experiment) are one of the very few cases where this works today. I didn't know that the W3C Network Error Logging API (NEL) provided a success_fraction feature. if that works, that could be the ideal here.

While the spec describes what it should report, it looked ambigious what it would report for a "success". To test this emperically I opened a Glitch (source) that sends to https://public.requestbin.com/r/enacmey4vt9b/.

From a quick glance at the data it produces, one concern may be that it's timing information is not very detailed. The output seems limited to the body.elapsed_time attribute. Are there are more timings we can gather with NEL? Maybe through some header configuration?

I believe NEL's elapsed_time is the time from fetchStart until the point in time where the "error" (or I guess in this case, success) is recorded, which is useful in relation to user experience, i.e. how long from when a user initiated a link navigation until it visibly failed, no matter where in the stack the specific timeout/error happened.

This is my understanding as well.

Another option might be to use the Resource Timing API

Yeah, we've decided for now to experiment with both reporting approaches. Getting the NEL data is "free", and it will be interesting to compare it with the Resource Timing data.

When we decide to take measurements against these domains, we'll serially fetch each site URL 3-5 times. This will provide us with some measurements that reuse the TCP + TLS sessions, giving us a truer estimate of RTT, and removing bias towards the primary CDN site (which the browser likely already has an existing session with).

With Resource Timing you can exclude DNS, e.g. responseStart - connectStart would give you the TCP/TLS/HTTP roundtrip time for a cold start after DNS. Or even connectStart->connectEnd if we don't need to measure HTTP specifically and the connection itself suffices. I guess that would suffice, but TCP+HTTP might be more representative of what we care about which is browsers making HTTP requests for pageviews.

If you go with this approach, I'd recommend distributing the JavaScript code via Extension:WikimediaEvents which is already enqueued on all pageviews. This is important as otherwise the new code would incur additional startup manifest overhead (performance guide), and would take 7-14 days to propagate to all pageviews. By adding it to WikimediaEvents the code is live globally within 5 minutes (RL architecture). You'll be limiting the instrumentation with sampling/conditionals, but the code will be live.

Feel free to tag me in code review on WikimediaEvents changes.

Very helpful advice, thanks!

Storing measurements

If you go with NEL I guess, it'd naturally go through EventGate->Kafka and from there to Logstash. I don't know if NEL already goes to something more easily querable in aggregate yet, possibly Hadoop could be plugged in there if not already to e.g. perform Hive queries to calculate percentiles by region or visualise in Turnilo etc.

See also T304373: Also intake Network Error Logging events into the Analytics Data Lake :)

If you go with the JS approach, I'd recommend using StatsV for ingest. That could look something like this (documentation). I'm normally London-based, but visiting California as you can see:

// ulsfo 68
// codfw 183
// eqiad 274
// esams 500
// drmrs 549
// eqsin 570
var duration = 274;
mw.track( 'counter.geoip_mapping.eqiad.NL.le_300', 1 ); // 250-300ms bucket

This would then be immediately queryable in Grafana by DC and Country code, where you could stacking or compare bucket sizes by hour, day, week, etc. Our Graphite server has its retention/aggregation rules set up such that counts remain reliable over large periods of time.

Hadn't considered using statsv but it does look handy for a quick analysis. Would the JS code extract the user's country code from our GeoIP cookie?

If you antipate more complex analysis needs, then for the JS approach you'd want to go with the Event Platform (create an event schema plus EventStreams config), and send the beacon using mw.eventLog.logEvent (docs). It would then end up in Kafka and Hadoop from where you could compute aggregates with Hive SQL or something like that.

Yeah we're also preparing a schema for EventGate :)
It seems very likely that (at least in the long run) we'll want to run complicated queries using Pyspark or similar, and then (very very eventually!) have a whole pipeline running there to then generate a map we can feed back into gdnsd.

First of all thank you Timo and Chris for the detailed information.

Measurement domain

  • The shuffling the targets/domains part is implemented in the demo in Probnik (here) as well as the custom library (here) that we created.
  • As Chris mentioned, we are thinking of discarding the results of first request to each DC to remove the bias towards the currently assigned DC. But as you mentioned having a separate IP + cert will give unbiased results. For the separate cert, I think it won't be difficult but for the separate IP, I think it will need some work to be done. Maybe set up a new server with a dedicated public IP for each DC. Therefore we may not provision new IPs for the initial version.
  • We were also comparing serial vs parallel probes to DCs. Like Chris mentioned we can go with the serial probes to minimize user impact for low-end device.

Collecting measurements

  • You are right, using NEL success_fraction would be out-of-the-box solution. But we have very less data to work with (only elapsed_time, which I guess is same as duration parameter returned by PerformanceAPI in case of success). And I don't think we can make browser send detailed information via NEL.
  • I am also somewhat inclined towards using the Performance Timing API because it provides detailed timing data. I think we should use responseStart - requestStart because it will give us consistent result across multiple requests. connectEnd - connectStart will return give result only for the first request. For subsequent requests it will return zero because TCP connection will be reused. Same for TCP+HTTP, we will be able to capture TCP data for the first request only. And also due to HTTP/2 coalescing the TCP timing for the primary DC may be biased.
  • This is also mentioned in the doc I shared with you.

Storing measurements

  • I like the idea of using buckets in StatsV but we are currently working on ingesting data using Event Platform. We have already created a schema for it (here).

It's awesome to see this moving along! One minor point:

This would then be immediately queryable in Grafana by DC and Country code, where you could stacking or compare bucket sizes by hour, day, week, etc. Our Graphite server has its retention/aggregation rules set up such that counts remain reliable over large periods of time.

Hadn't considered using statsv but it does look handy for a quick analysis. Would the JS code extract the user's country code from our GeoIP cookie?

While even this data would be interesting to look at in the short term, in the long term we really want the measurement data to be aggregated per IP network (/24 for IPv4, /48 for IPv6, as these are the minimum route sizes), rather than per country.

Change 927238 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/mediawiki-config@master] Enable user network probe events

https://gerrit.wikimedia.org/r/927238

Change 927238 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable user network probe events

https://gerrit.wikimedia.org/r/927238

Mentioned in SAL (#wikimedia-operations) [2023-06-05T17:28:14Z] <cdanis@deploy1002> Started scap: Backport for [[gerrit:927238|Enable user network probe events (T332024)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-05T17:30:01Z] <cdanis@deploy1002> cdanis: Backport for [[gerrit:927238|Enable user network probe events (T332024)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-05T17:38:16Z] <cdanis@deploy1002> Finished scap: Backport for [[gerrit:927238|Enable user network probe events (T332024)]] (duration: 10m 02s)

Change 927647 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion

https://gerrit.wikimedia.org/r/927647

Change 927647 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig - development.network.probe- disable canary events and hadoop ingestion

https://gerrit.wikimedia.org/r/927647

Temporarily disabling hadoop ingestion and canary events. The ctx field needs a schema.

Mentioned in SAL (#wikimedia-operations) [2023-06-06T13:06:09Z] <otto@deploy1002> Synchronized wmf-config/ext-EventStreamConfig.php: EventStreamConfig - Disable canary events and hadoop ingestion for development.network.probe - T332024 (duration: 07m 17s)

Mentioned in SAL (#wikimedia-analytics) [2023-06-06T13:09:16Z] <ottomata> EventStreamConfig - temporarily Disable canary events and hadoop ingestion for development.network.probe stream - T332024

Change 927669 had a related patch set uploaded (by Jameel Kaisar; author: Jameel Kaisar):

[schemas/event/primary@master] Fix: Add schema to ctx field of network probe schema

https://gerrit.wikimedia.org/r/927669

Change 927669 merged by jenkins-bot:

[schemas/event/primary@master] Fix: Add schema to ctx field of network probe schema

https://gerrit.wikimedia.org/r/927669

Change 928116 had a related patch set uploaded (by Jameel Kaisar; author: Jameel Kaisar):

[operations/puppet@production] GeoIP experiments: Stop Network Probes

https://gerrit.wikimedia.org/r/928116

Change 928117 had a related patch set uploaded (by Jameel Kaisar; author: Jameel Kaisar):

[operations/puppet@production] GeoIP experiments: Stop NEL Success Reports

https://gerrit.wikimedia.org/r/928117