Identify and label prefetch proxy data in our traffic
Open, MediumPublic
Actions

Assigned To

Authored By

	Mayakp.wiki
	Sep 15 2023, 7:25 PM

Description

In June 2022, Google launched a new Private prefetch proxy feature in Chrome Mobile. We believe that this has led to an increase in automated traffic in Readers data ( see T341848#9014932).
We want to find a way to be able to identify (and then decide whether to remove) this traffic by labeling it as such in webrequest and other derived pageview tables so that we can answer a few questions -

are all the private prefetch pageviews correctly labeled as automated traffic ?
we observed similar but slightly smaller increases in user traffic. is some of the prefetch data being mislabeled as user traffic?
maybe both are happening at the same time?

This is possible in part by checking the Sec-Purpose: prefetch; anonymous-client-ip request header. (T341848#9143300)

There are similar headers from other browsers, although not necessarily in the context of an intervening IP address-shielding proxy service (however, for measurement purposes on standard web logs, may yield some similar behavior). To allow for greater coverage of probable headers, patch activity starting from January 2024 is going to try to catch more of this.

To learn more about some popular browsers, start from the historical article at https://lionralfs.dev/blog/exploring-the-usage-of-prefetch-headers/ , then notice the following as well:

The Purpose: prefetch header is used in Safari / WebKit, at least in some contexts. It seems probable that this will be updated eventually, although it may be necessary to file a bug or check on WebKit's Slack or something similar in order to prompt for clarity about standardization. Note that this isn't to be confused with iCloud Private Relay (e.g., exit IPs as mentioned in an iCloud Private Relay developer-focused page); it is possible that private Safari builds take under consideration certain prefetch behavior, but it would require extended observation and there are diminishing returns in looking further than exit IPs and UA.
Firefox has transitioned in the prior year from X-Moz: prefetch to the standardized header with its singular token Sec-Purpose: prefetch according to recently merged code and MDN.
Chrome / Chromium (also the engine for newer Edge) has a couple apparently canonical values (Sec-Purpose: prefetch and Sec-Purpose: prefetch;anonymous-client-ip; the latter being the thing that spurred this task) and another possibly employed omnibox incantation (Sec-Purpose: prefetch;prerender per code and sparse-to-empty search engine results, but also much more visible https://developer.chrome.com/docs/web-platform/prerender-pages). The Private prefetch proxy article mentions a header of Sec-Purpose: Prefetch; anonymous-client-ip, so we'll try to handle that in case there's some magic in the way Chrome is configured to talk to Google services like Google search...but we'll also look expressly for the lowercased value without any spaces.

Details

Subject	Repo	Branch	Lines +/-
profile::cache::kafka::webrequest: change the JSON format	operations/puppet	production	+1 -1
varnish: enrich X-Analytics for browser prefetch / prerender / preview	operations/puppet	production	+29 -0
Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview"	operations/puppet	production	+0 -29
varnish: enrich X-Analytics for browser prefetch / prerender / preview	operations/puppet	production	+29 -0
profile::cache::kafka::webrequest: allow to customize the format	operations/puppet	production	+6 -4

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		OSefu-WMF	T322697 Investigate recent rises in Automated traffic
Resolved	Spike	Mayakp.wiki	T336715 Investigate relation of Prefetch feature to increase in automated traffic and impact on unique devices
Open		dr0ptp4kt	T346463 Identify and label prefetch proxy data in our traffic
Resolved		OSefu-WMF	T364126 Disable Chrome Private Prefetch Proxy
Resolved		OSefu-WMF	T370750 Impact Monitoring Plan - Disable Chrome Prefetch Proxy

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mayakp.wiki triaged this task as Medium priority.Sep 15 2023, 7:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 15 2023, 7:25 PM

note that we would not want to change any of our existing dimensions (like agent_type) to indicate prefetch pageviews since this will break our reporting and has consequences on Superset dashboards. Instead find a way to store this in any existing field or create new field

MGerlach subscribed.Sep 18 2023, 5:04 PM

Mayakp.wiki changed the subtype of this task from "Spike" to "Task".Sep 20 2023, 9:04 PM

Mayakp.wiki moved this task from Incoming to Watching on the Movement-Insights board.

mforns removed mforns as the assignee of this task.Oct 2 2023, 5:40 PM

mforns updated the task description. (Show Details)

• lbowmaker subscribed.Dec 5 2023, 1:05 AM

I imagine this requires:

Modify the data collection pipeline (probably Varnishkafka and/or Gobblin + wmf_raw.webrequest schema) to collect the Sec-Purpose header.
Modify the refine_webrequest job to either pass the header to wmf.webrequest or calculate an is_prefetch flag (with corresponding Hive schema changes)

At this point we could already start investigating no? Or do we need to forward the field into pageview_hourly as well?
@JAllemandou @Milimetric thoughts?

If we start having data about which webrequest hits are prefetch or not, we definitely would be able to investigate! I'm in favor of moving fast and passing this header through as a new webrequest field. No change would be needd in Gobblin, only in wmf_raw.webrequest and wmf.webrequest schemas, as well as refine_webrequest hql to forward the field.

@JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion?

OSefu-WMF subscribed.Dec 5 2023, 3:24 PM

Would the header be translated into an x-analytics value?

In T346463#9383399, @WDoranWMF wrote:

@JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion?

I don't think the change is ocmplicated, but I feel confident doing it myself not having touched the varnishkafka code ever. Maybe with some help from @elukey if he's not yet gone, or anyone else having already done those things.

In T346463#9383792, @odimitrijevic wrote:

Would the header be translated into an x-analytics value?

It's a good idea. This approach would prevent us to have to change any job on the cluster to access the data. The downside is that of there are errors it could pollute the whole x-analytics header, but I think it's worth going that way nonetheless.

+1 using x_analytics if possible!

Removing Data SRE in favour of Traffic

@KOfori is this something Traffic would be able to help us with? We are unfortutnately neither able to make the change nor really review it as we don't know enough about that layer.

• WDoranWMF added a project: Traffic.Dec 5 2023, 8:55 PM

• WDoranWMF added a subscriber: KOfori.

This is the varnish code (VCL) that does analytics-y things to create and update the X-analytics header. Adding stuff here would prevent us from having to change varnishkafka. Or maybe I misunderstood the whole thing, which is always possible in Varnish land :)

In T346463#9384019, @JAllemandou wrote:

In T346463#9383399, @WDoranWMF wrote:

@JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion?

I don't think the change is ocmplicated, but I feel confident doing it myself not having touched the varnishkafka code ever. Maybe with some help from @elukey if he's not yet gone, or anyone else having already done those things.

IIUC one of the ideas is to add another request header to the webrequest format right? If so it is sufficient to modify:

https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/cache/kafka/webrequest.pp#L144

The format of that string is not very well documented, but there is a guideline in:

https://github.com/wikimedia/operations-software-varnish-varnishkafka/blob/master/varnishkafka.c#L808

There are other requests headers, for example:

%{Referer@referer}i %{User-Agent@user_agent}i %{Accept-Language@accept_language}i

So in theory it should be sufficient to add ${Sec-Purpose@sec_purpose}i somewhere in the string, depending on the position that you want this field to appear in the webrequest json event payload (most surely you don't care so it can go anywhere). The part after the @ in the config states the name of the field in the Webrequest json payload.

Important note: since the profile::cache::kafka::webrequest's format field is not taken by hiera, once we modify it and puppet-merge it then it is deployed on all caching nodes during the subsequent puppet runs. So I'd advise to test it on a caching nodes first (with the help of Traffic, or if you want I can check as well) and then to apply the big change afterwards. Or we can modify the puppet class to be more customizable, not a big change as well. Lemme know what you prefer (new field vs X-Analytics).

Change 980911 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::kafka::webrequest: allow to customize the format

https://gerrit.wikimedia.org/r/980911

gerritbot added a project: Patch-For-Review.Dec 6 2023, 5:07 PM

I created a tmux session on cp4038 with the following:

sudo varnishlog -n frontend -q 'ReqHeader:Sec-Purpose eq "prefetch; anonymous-client-ip"'

The above should return all the requests with the header with the value that we are looking for. I found already requests with Sec-Purpose set, but with other values, so in theory the Varnishkafka config that I wrote above should be ok.

Confirmed, I found a request with:

-   ReqHeader      sec-purpose: prefetch;anonymous-client-ip

So it is a good confirmation, I think we can probably skip https://gerrit.wikimedia.org/r/980911 and modify format directly.

Change 980912 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::kafka::webrequest: add the Sec-Profile req header

https://gerrit.wikimedia.org/r/980912

If the final decision is to proceed with a new field in Webrequest, https://gerrit.wikimedia.org/r/c/operations/puppet/+/980912 should be sufficient :)

I like where @elukey is going with this.

I saw this ticket and thought I'd share some perspective. It may be this has all been considered (e.g., in conversations, bot detection in practice, etc.), so sorry if I'm duplicating anything!

A potential "scrappy" option while waiting for any header enrichment is to parse logs on other pieces. This allows for sidestepping an edge cache change amidst the near-holiday season (but to be clear, header enrichment is still strongly advisable and if folks closer to this codebase are comfortable, the header enrichment is the fastest "quick" way to have a good sense of magnitude...as @elukey mentions this can be extrapolated via Kafka, too).

The source IP list mentioned from the article plus the browser vintage implied from the article plus the very high likelihood of the traffic originating from a Google search engine result (although the list of sourcing will probably fluctuate as Google examines the offering) are of special interest (note well the Referer and User-Agent in particular in @elukey's comment at T346463#9387474 ) and there's likely to be a high degree of correlation with the ISP of "Google" (because this is all ASN mapped with these lists that are seemingly autogenerated). EDIT: n.b., additional browsers from Google are now making requests via this mechanism (or similar mechanisms, at least) from the looks of it.

To give an idea of the sorts of signals, the following simple query gives an impression of what might be obvious traffic using this prefetch behavior coming from four Google IPs in the link above.

select
http_status,
http_method,
uri_host,
uri_path,
referer,
is_pageview,
user_agent_map['os_family'],
user_agent_map['browser_family'],
user_agent_map['os_major'],
user_agent_map['browser_major'],
user_agent_map['device_family'],
access_method,
referer_class,
isp_data['isp'],
tls,
webrequest_source
from
wmf.webrequest
where
year = 2023
and month = 12
and day = 5
and ip in ('72.14.201.204', '72.14.201.205', '72.14.201.206', '72.14.201.207');

Looking at wmf_raw.webrequest I also see that traffic from this very short list of IPs in this sample query (which is by definition not the fully expanded list of IPs) doesn't appear to be making any additional requests other than webpages that I can tell. They don't seem to be requesting images or other subresources from what I can see.

Whether this holds for all IPs in the set I don't know, but it could become knowable. It's more expensive to reverse map IPs not making requests for additional subresources, but it's possible.

We could also turn off the prefetch feature at our servers as Google's documentation suggests, but it may take a little while for such a change to percolate plus we could get dinged in search engine results. If we were to do this we'd probably want to A/B test (standard is usually 60 days of traffic there to let things work their way through the system) in that case to make sure we know the ramifications if so.

I wasn't sure if other folks had yet examined the link with IP, ISP, and Referer in other related analysis, although I saw UA as a possible consideration maybe? I'm getting some dashboard timeouts, and haven't trawled all the comments...am trying to get this out before I take some time off, and so am erring on the side of sharing early without the complete check.

Now, technically the header mentioned in the article could be spoofed, however unlikely that is. So, if we want to be "sure" that it's this prefetch traffic, more fields probably need to be examined additionally during refine anyway if going for correctness.

This is the varnish code (VCL) that does analytics-y things to create and update the X-analytics header.

Can we do this instead of adding a new field? It means VCL code, but I prefer that to adding new fields where possible.

Change 981352 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] webrequest varnishkafka - Add to X-Analytics the Sec-Purpose HTTP header

https://gerrit.wikimedia.org/r/981352

Alternative using X-Analytics VLC:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352

It so happens I sniped myself into looking into this after noticing a lot of google ips in trending streaming pages. Here pyspark snippets with more results.

Main observations:

there are a significant number of wikipedia pageviews (~17%) from google private prefetch proxy ip addresses
- these pageviews do not result in any other requests from the user directly to wikipedia servers (the term "prefetch" is somewhat confusing)
- google's blogpost states: From past experiments, we know that this feature typically results in less than 2% extra requests for main resources (for example HTML documents).
- google publishes the subnets for the ips used by google private proxy
of the pageviews coming from google proxy ips, ~57% are classified as automated
- this is due to the fingerprint based mechanisms for bot detection, as the traffic comes from few ips (1639 in this analsyis) and the user agents are simplified
- this means we partially include google proxy requests in the pageview statistics, e.g. we neither exclude nor include them as a whole
the vast majority of google prefetch originiate from chrome on Android (86%)
- the global market share of android is 70%
- for wmf pageviews classified as user: 64.6%
- if we include the pageviews tagged as automated coming from google proxy servers: 69.7%
the hypothesis that we wrongly classify a significant number of real pageviews as automated still stands. More investigation is warranted.

Suggestions:

To trigger google private proxy requests and analyse the scenarios
- Needs to happen on Android or Windows
Google private proxy is only used if there is no cookie, but wikipedia is using cookies
- Why are there so many requests from Android? could it be e.g. the search bar on the android homescreen?
More analysis
- what is the relationship with google referred requests? why are there so many of these from Chrome than others?
- this analysis is based only on a single hour of pageview actor data, expand/verfify
Google private proxy requests are not cached by google
- If anything we are receiving more traffic as a result (e.g. as not all requests are actual pageviews)
- Can we get more information from google about the nature of these requests? Can they tell us what percentage is humans loading a wikipedia page?
- Can/should we disable google private proxy as an experiment?

Vgutierrez subscribed.Dec 11 2023, 5:18 PM

VCL patch submitted by @Ottomata (https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352) looks good to me, @elukey CR to add a new WebRequest field also looks good (https://gerrit.wikimedia.org/r/c/operations/puppet/+/980912). In terms of deployment, updating the VCL is slightly easier than updating the varnishkafka JSON format as VCL updates don't require restarting varnish and the varnishkafka change requires a varnishkakfa restart to be applied.

Either approach seems fine to me and I don't have strong opinions on which is better. I have +1d the VCL change based on @Ottomata's preference expressed in T346463#9390648.

The only thing I would point out is that we will need to update https://wikitech.wikimedia.org/wiki/X-Analytics as well, with reference to this change.

updating the varnishkafka JSON format

Also, this would require schema changes in Hive.

So ya let's go with VCL!

So ya let's go with VCL!

Change 980911 abandoned by Elukey:

[operations/puppet@production] profile::cache::kafka::webrequest: allow to customize the format

Reason:

https://gerrit.wikimedia.org/r/980911

Hi DPE team, Can you pls let me know the status of this request? I was not able to get any results for Sec-Purpose: Prefetch; anonymous-client-ip when querying the x_analytics field.
And a follow up question, can we get this added as a flag or new column to the readers tables derived from webrequest - pageviews_hourly and pageview_daily so its easier for us to query and visualize? pls let me know if this requires a new phab task.

@Mayakp.wiki the patch to watch is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352/. This has not yet been merged and deployed. When it is, you'll start seeing the changes in x_analytics.

IIRC, the decision was to wait until the new year, so as not to risk a mistake while people were out on holidays.

I'm about to go out on sabbatical and parent leave for a while, so we probably should get someone else to deploy this. @BTullis @Gehel ?

I'll amend the patch.

dr0ptp4kt updated the task description. (Show Details)Jan 10 2024, 6:06 PM

I'm scheduling time with @Mayakp.wiki and @MGerlach to soon discuss potential future use cases, but if folks familiar with VCL could give the latest version of the patch a look it'd be appreciated. I updated the Description a bit to note some additional considerations - it dawned on me we ought to capture some different browsers so that we can hopefully cover the broader majority of browser prefetch, irrespective of use of an intervening proxy architecture. This doesn't solve for the difficulty of properly classifying whether a pageview actually materialized in the user's browser (e.g., from cache of a prefetched resource), but hopefully aids for more coverage in case we are curious about different browser vintages.

dr0ptp4kt updated the task description. (Show Details)Jan 10 2024, 6:58 PM

odimitrijevic assigned this task to dr0ptp4kt.Jan 12 2024, 4:14 PM

@Vgutierrez @BTullis https://gerrit.wikimedia.org/r/c/operations/puppet/+/981352 is ready for review. Would it be possible to review and arrange for a deployment next week?

In the meeting with @Mayakp.wiki and @MGerlach today we discussed approach. You'll see the gist of the code changes since Andrew's version reflected in the updated task Description (basically, making it possible to better understand prefetch across the multiple major browsers), and the extra piece I just added several minutes ago was to have one additional X-Analytics key to denote the "version" of the Chrome private prefetch proxy header, because although UA is a signal we can use, here we expect varying approaches in future from Chrome / Google and hope to be able to quickly differentiate as a separate dimension.

In the Android emulator, it's possible to make Chrome initiate this sort of request. Unfortunately, it seems that in the emulator there may be some lower level networking issue, at least from a Mac (Intel based in this case), because it shows a network related error, and a click on the article title from a Google SERP issues a plain GET not loaded from cache based on a quick look with DevTools and kafkacat. I somewhat strongly expect that the network related error would not be present (i.e., the fetch would succeed and be cached) on a real physical device when one is able to trigger Chrome private proxy prefetch.

I have Fire devices where I may be able to sideload a Chrome APK and try to connect the debugger - with any luck networking stuff would Just Work™️. I tried with someone with a newer Samsung (Android 14) device and Chrome 120 and developer mode and ADB USB debugging and Chrome DevTools, cleared all browsing history and ratcheted up the prefetch settings in Chrome (it's in Chrome's triple dot menu > Settings > Privacy) before searching Google, but wasn't able to trigger the prefetch with a few Google queries of the form wikipedia topic ... so may need to enlist some help from others with Android devices to see if they can reproduce.

Anyway, here is what it looks like so far on the client side.

{F41668102, size=full}

I managed to make a connection via the Chrome private prefetch proxy using a Fire with a sideloaded Chrome 120 APK. In this case the User-Agent is perceived as a desktop one by Google, but processed as mobile in the Wikipedia infrastructure, so Chrome saw it as a 302 (delivered from the Wikipedia edge via the Google proxy) while in the Google SERP.

Screenshot 2024-01-12 at 4.59.11 PM.png (1×3 px, 925 KB)

By the way, here are corresponding wmf_raw.webrequest fields for this latest SERP. Notice how two prefetech requests were made from the same SERP, but the exit IP differs a little.

hostname	dt	ip	cache_status	http_status	response_size	http_method	uri_host	uri_path	uri_query	content_type	referer	x_forwarded_for	user_agent	accept_language	x_analytics

cp1114.eqiad.wmnet	2024-01-12T22:57:12Z	2001:4860:7:70e::ed	int-front	302	0	GET	en.wikipedia.org	/wiki/Dolphin		-	https://www.google.com/	NULL	Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36	en-us,en;q=0.9	https=1;client_port=33743;nocookies=1


cp1102.eqiad.wmnet	2024-01-12T22:57:12Z	2001:4860:7:30e::ff	int-front	302	0	GET	en.wikipedia.org	/wiki/Miami_Dolphins		-	https://www.google.com/	NULL	Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36	en-us,en;q=0.9	https=1;client_port=36439;nocookies=1

Thanks for the updates @dr0ptp4kt, and nice that you are able to reproduce such a google proxy request.

One thing that I am interested in is to find the dominant cause for the google proxy requests. The google proxy only applies if there is no cookie, and the vast majority of proxy requests (93%) are google.com referred. The overall large number of google proxy requests - e.g 16% (!) of all webrequests in the small analysis I ran - in conjunction with the results you posted above, makes me think that most of these are coming from SERP pages started from the search bar on android? Did the "dolphin" search result in prefetch requests for all wikipedia links without clicking on them? At this scale of requests we can expect that vast majority of users didn't modify the prefetch settings in chrome, there must be a commonly used mechanism that triggers the google proxy on android.

Thanks @fkaelin . Yes, those prefetches happened without clicking on them. It seems to occur both for searches originating from the location bar, as well as searches entered into the <input> search field on the Google search webpage.

There have been some different calls-to-action in the Chrome mobile browser over the past years, and it's a little hard to trace back through all of those and remember what exactly came to pass and how we got to the point of the default settings.

But, what I saw for the default settings with Chrome on both the friend's Android 14 phone and on my sideloaded APK on the Fire was the following.

Go to Chrome's triple dot vertical ellipsis ("overflow") menu > Settings > Privacy and security > Preload pages.

(   ) No preload
Pages load only after you open them

( 🔘 ) Standard preloading
Some of the pages you visit are preloaded. Pages may be preloaded through Google servers when linked from a Google site.

(   ) Extended preloading
More pages are preloaded. Pages may be preloaded through Google servers when processed by other sites.

I ended up cranking it up to "Extended preloading" to make it be more aggressive in my sideloaded APK, but I'm pretty sure that "Standard preloading" is all that it takes for many browsing scenarios.

Based on the total impression volume in Search Console of for example English Wikipedia, as well as the likely clickthrough ratios, the level of requests here sounds about right.

I added a link in the Description to Chrome source that seems relevant for the header setting mechanism, but I think one would need to spend some more quality time looking for tokens like "prefetch" and "preconnect" and the like in https://github.com/chromium/chromium/ to find their usage, and probably doing some runtime checks with the browser (DevTools and possibly with native runtime debugging) to track down any extra special processing that's happening in Google markup, JS, HTTP headers, and possibly any specific packaging of Chrome (then again, maybe there's nothing additional in the packaging of the distributed browser w.r.t. the base source).

Just to put something concrete (not saying this is the thing), here's an interesting unit test on the prefetch predictor mechanism:

https://github.com/chromium/chromium/blob/00719be3079e3366cf22ef952bccfea2762b4aae/chrome/browser/predictors/resource_prefetch_predictor_unittest.cc

I also saw this: https://chromium-review.googlesource.com/c/chromium/src/+/5012037/4/content/browser/renderer_host/navigation_request.cc

~~But it seems like that file disappeared (EDIT: actually, I need to download the bigger Git repo, I think...but these repos are huge!).~~ I'm trying to git log -S"Sec-Purpose" on Chromium but it's making my fan run on my laptop :)

dr0ptp4kt updated the task description. (Show Details)Jan 17 2024, 7:35 PM

@fkaelin Sec-Purpose: prefetch;prerender is mentioned for the omnibox use case at https://developer.chrome.com/docs/web-platform/prerender-pages , so I've added that, as well as Chrome's apparent link preview functionality (Sec-Purpose: prefetch;prerender;preview).

Change 981352 merged by Vgutierrez:

[operations/puppet@production] varnish: enrich X-Analytics for browser prefetch / prerender / preview

https://gerrit.wikimedia.org/r/981352

It's live and looking good in kafkacat. Now we wait a little for stuff to show up in the analytics tables. Thanks @Vgutierrez and @BTullis for the additional reviews and thanks @Vgutierrez for the deployment.

Documentation updated: https://wikitech.wikimedia.org/w/index.php?title=X-Analytics&diff=2140528&oldid=2028273

It's entering the analytics system based on the following query:

select http_status, hour, x_analytics_map['prefetch_sec_purpose'], x_analytics_map['prefetch_purpose'], x_analytics_map['prefetch_x_moz'], count(1) from wmf.webrequest where year = 2024 and month = 1 and day = 18 and http_status in (200, 301, 302, 304) and (x_analytics_map['prefetch_sec_purpose'] is not null or x_analytics_map['prefetch_purpose'] is not null or x_analytics_map['prefetch_x_moz'] is not null) group by http_status, hour, x_analytics_map['prefetch_sec_purpose'], x_analytics_map['prefetch_purpose'], x_analytics_map['prefetch_x_moz'];

Nice!

pa = spark.table("wmf.pageview_actor").where("""year=2024 and month=1 and day=18 and hour=16""")
prefetch_fields = [ 'prefetch_sec_purpose', 'prefetch_purpose', 'prefetch_x_moz']
cols = [F.col("x_analytics_map").getItem(f).isNotNull().alias(f) for f in prefetch_fields]
pa.groupBy(*cols).count().orderBy("count",ascending=False).show(1000,truncate=False)


+--------------------+----------------+--------------+--------+
|prefetch_sec_purpose|prefetch_purpose|prefetch_x_moz|count   |
+--------------------+----------------+--------------+--------+
|false               |false           |false         |50747641|
|true                |true            |false         |6545852 |
|false               |true            |false         |5111    |
|true                |false           |false         |122     |
|false               |false           |true          |29      |
+--------------------+----------------+--------------+--------+

Change 991563 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview"

https://gerrit.wikimedia.org/r/991563

BTullis mentioned this in T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey.Jan 19 2024, 9:49 AM

Change 991563 merged by Btullis:

[operations/puppet@production] Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview"

https://gerrit.wikimedia.org/r/991563

Change 992782 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] varnish: enrich X-Analytics for browser prefetch / prerender / preview

https://gerrit.wikimedia.org/r/992782

Fabfur subscribed.Jan 25 2024, 11:53 AM

Change 992782 merged by Btullis:

[operations/puppet@production] varnish: enrich X-Analytics for browser prefetch / prerender / preview

https://gerrit.wikimedia.org/r/992782

It's back up and running. The following query is producing results. Note that the is_goog_isp field is mainly for helping to better resolve if traffic was likely to come via a Google proxy server; but all the usual caveats apply such as ISP mappings can change, isp_data['isp'] can bear strings that merely contain "Google" but aren't an exact match, and so on. The IP list at https://www.gstatic.com/chrome/prefetchproxy/prefetch_proxy_geofeed , like what you see in @fkaelin analysis at T346463#9393571, adds more precision.

select
day,
hour,
is_pageview,
isp_data['isp'] = "Google" as is_goog_isp,
agent_type,
access_method,
http_method,
http_status,
x_analytics_map['prefetch_sec_purpose'] as psec,
x_analytics_map['prefetch_purpose'] as ppur,
x_analytics_map['prefetch_x_moz'] as pmoz,
x_analytics_map['chrome_private_prefetch_version'] as ver,
count(1) as ct,
user_agent_map['browser_family'] as brow
from wmf.webrequest
where
year = 2024
and month = 1
and day = 25
and hour > 11
and (x_analytics_map['prefetch_sec_purpose'] is not null or x_analytics_map['prefetch_purpose'] is not null or x_analytics_map['prefetch_x_moz'] is not null)
group by
day,
hour,
is_pageview,
isp_data['isp'] = "Google",
agent_type,
access_method,
http_method,
http_status,
x_analytics_map['prefetch_sec_purpose'],
x_analytics_map['prefetch_purpose'],
x_analytics_map['prefetch_x_moz'],
x_analytics_map['chrome_private_prefetch_version'],
user_agent_map['browser_family']
order by
day,
hour,
is_pageview,
is_goog_isp,
agent_type,
access_method,
http_method,
http_status,
psec,
ppur,
pmoz,
ver,
brow
limit 20000;

• lbowmaker moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Feb 22 2024, 2:21 PM

nshahquinn-wmf subscribed.Feb 24 2024, 1:35 AM

Change 980912 abandoned by Elukey:

[operations/puppet@production] profile::cache::kafka::webrequest: change the JSON format

Reason:

https://gerrit.wikimedia.org/r/980912

Maintenance_bot removed a project: Patch-For-Review.Mar 11 2024, 2:31 PM

Mayakp.wiki mentioned this in T336715: Investigate relation of Prefetch feature to increase in automated traffic and impact on unique devices.Apr 5 2024, 10:11 PM

OSefu-WMF mentioned this in T364126: Disable Chrome Private Prefetch Proxy.May 3 2024, 1:32 PM

OSefu-WMF added a subtask: T364126: Disable Chrome Private Prefetch Proxy.May 3 2024, 1:36 PM

KOfori changed the status of subtask T364126: Disable Chrome Private Prefetch Proxy from Open to Stalled.May 6 2024, 2:52 PM

OSefu-WMF changed the status of subtask T364126: Disable Chrome Private Prefetch Proxy from Stalled to In Progress.May 9 2024, 5:33 PM

Pablo mentioned this in T368388: WE4.2.1 - Unique Device .Jul 5 2024, 7:54 PM

OSefu-WMF closed subtask T364126: Disable Chrome Private Prefetch Proxy as Resolved.Jul 23 2024, 9:08 AM

Ottomata mentioned this in T371321: [Epic] Instrument pageviews using events, instead of webrequests.Jul 29 2024, 8:06 PM

	F41668458: Screenshot 2024-01-12 at 4.59.11 PM.png
	Jan 12 2024, 11:33 PM

	F41668031: Screenshot 2024-01-12 at 1.37.27 PM.png
	Jan 12 2024, 8:22 PM

	F41668029: Screenshot 2024-01-12 at 1.35.25 PM.png
	Jan 12 2024, 8:22 PM

	F41668030: Screenshot 2024-01-12 at 1.35.04 PM.png
	Jan 12 2024, 8:22 PM

Identify and label prefetch proxy data in our trafficOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Identify and label prefetch proxy data in our traffic
Open, MediumPublic
Actions

Related Objects
Search...