Page MenuHomePhabricator

Compare actual webrequest traffic with traffic that was in-sample for the A/A test
Closed, ResolvedPublic2 Estimated Story Points

Description

TL;DR; 0.88% over two small tests, more analysis coming

According to this graph, we were fully sampling traffic during these two periods:

2025-06-02 14:27:00 to 2025-06-02 14:37:30 we got 210127 - 51952 = 158175 abtested requests
from 2025-06-02 15:30:00 to 2025-06-02 15:31:30 we got 265998 - 245296 = 20702 abtested requests

To get the number of abtested requests, we have to edit the grafana graph to show the absolute number of that metric and subtract the start time count from the end time count. We also have to take site out of the breakdown so we can get the sum across all data centers.

English Wikipedia was the only wiki with sampling set up, and it was configured to enroll 1% of all requests. The number of webrequests to enwiki can be found in the wmf.webrequest table, see in comments for analysis.

Event Timeline

Made a table for others to peek at:

 create table sample_webrequest_for_xlab_2025_06_02 as
(select * from wmf.webrequest
  where (   dt between '2025-06-02T14:27:00' and '2025-06-02T14:37:30'
        or  dt between '2025-06-02T15:30:00' and '2025-06-02T15:31:30')
    and normalized_host.project = 'en'
    and normalized_host.project_family = 'wikipedia'
    and year=2025 and month=6 and day=2 and hour in (14, 15)
);

(had to use 16G executors... weird)

This table has 33624340 rows in it... which translates to about 50k requests / second. That sounds roughly correct.

(EDIT: this was based on a bad reading of the graph so is now wrong: So, in this very rough look, we were hoping for 1% and actually got something like .006%)

That doesn't sound right to me.

I'll quote here for posterity and convergence, what I said in the Slack thread for an initial estimate:

Hot take / rough math on the ~1% abtest run that took place at ~15:30 UTC and how the stats look to me:
If I look at the "abtested requests" grafana graph, and sum the ~15:30 peak across all 7x sites (by editing the panel to remove the site selector and just grab all in aggregate), I get a global peak of 262 abtested reqs/sec during that event (2 minute averaging making it a little fuzzy, but still).
If I look at Turnilo's "Webrequest sampled live" at 15:30, and filter for text cluster and URI host en.wikipedia.org I see a 1-minute hits value at 13.4K.
Doing the math to turn this into reqs/sec (/60 * 128), I get ~28.6K total rps during that minute for enwiki webrequest.262 / 28600 => ~0.92% - which seems reasonably close to 1% given all the fuziness involved in the various graphs/averaging/etc and comparing across two different account systems, basically. (edited)

But looking at the numbers you're quoting:

2025-06-02 14:27:00 to 2025-06-02 14:37:30 we got 1679 requests
2025-06-02 15:30:00 to 2025-06-02 15:31:30 we got 319 requests

That graph is giving a rate in reqs/second (averaged over 2 minute windows), not a count of requests (and at least in the default view linked, is only showing esams, which is 1 of 7 edge sites). If I edit the graph panel and remove the site selection (all global stats together as one), and get rid of the rate() function (just show the global counter's absolute value): the counter has the absolute value 51952 @ 14:27:00, and then it has the absolute value 210127 @14:37:30 , for a total of 158175 tested requests during that time window. Applying the same method to 15:30:00 - 15:31:30, I get another 19488 there. By this method, I'm counting the total across those two windows as 177663.

This table has 33624340 rows in it... which translates to about 50k requests / second. That sounds roughly correct.

I think that sounds very roughly correct as a rate for en.wikipedia.org + en.m.wikipedia.org, but the latter wasn't part of the test definition that I saw. The mobile variant is worth somewhere in the ballpark of 40% of the combined rate of the two, around that timeframe.

You're absolutely right. Mistakes were made.

  1. I read the graph wrong, exactly as you detail above. This changes the count of abtested requests to be exactly what you said, Brandon. Which, of course, we're reading the same graph now. I repeated the method you used and found almost the same. There was only a tiny difference on the second window where I got 20702, these numbers are now updated in the description
  1. I saw but didn't realize we didn't include en.m.wikipedia.org, that should've been included and we'll have to update that part of the authority=varnish endpoint.

With those two fixes in place, the math now makes sense. For both of the tests that we ran yesterday, in the time windows specified above, the total number of abtested requests is 178877. In the same period, the count of webrequests is:

 select count(*)
   from default.sample_webrequest_for_xlab_2025_06_02
  where not array_contains(normalized_host.qualifiers, 'm');

-- NOTE: I checked and distinct qualifiers here are just [] or ['m']

20294660

So 178877 / 20294660 = 0.0088, which means 0.88%

Pretty close. We're going to do a follow-up look at a longer time span and compare the events that make it through EventGate with the pageviews observed in our test window.

For our investigation later, just parking this here

 select length(uri_query) > 0 as has_query,
        count(*)
   from pageview_actor
  where pageview_info['project'] = 'en.wikipedia'
    and access_method = 'desktop'
    and agent_type = 'user'
    and year=2025 and month=6 and day=2 and hour=11
    and is_pageview
    and not is_redirect_to_pageview
  group by length(uri_query) > 0
;

So 178877 / 20294660 = 0.0088, which means 0.88%
Pretty close.

Good news then!

Honestly, I don't expect this level of comparison to come out any better than "roughly in the right ballpark", and ~0.88% vs 1% seems close enough for a basic gut-check verification on varnish-level request stats. Even in the best case, there will be random statistical variance in the reqrates of the 1% of selected agents vs the average reqrates of the whole population.

There are probably other more-trivial errors in play too though (e.g. timing misalignment between different record sources being compared here, the possibility we have some class of internal reqs or healthchecks, etc that count in one place and not the other, the timing of config pushes given I think you're pushing/pulling config with a past start time and future end time, etc).

Some of that error would be reduced by using a longer window for sure, and only checking the stats well inside the interior of the window (to assure start/end stamps and config-push-timing don't have impacts). Obviously, the analytics view on JS-submitted metrics should be a much better view of the world (as this will self-select out non-cookie-preserving bots, etc).

Milimetric set the point value for this task to 2.Jun 12 2025, 4:36 PM