Page MenuHomePhabricator

Investigate relation of Prefetch feature to increase in automated traffic and impact on unique devices
Open, MediumPublicSpike

Description

We have been observing spikes in our automated bot traffic since June 2022. Based on our conclusion from T341848#9170131 , we have identified that the rise in pageview traffic (automated and user) can be attributed to Chrome's private prefetch feature that released in June 2022.

This task has been modified to remove our initial hypothesis of this being due to UA deprecation and will record all the discussions we've had so far and next steps.

Event Timeline

Mayakp.wiki changed the subtype of this task from "Task" to "Spike".May 19 2023, 4:10 PM

moving to Research-Freezer until more information becomes available.

@MGerlach will consult on this. Assigning the investigatory task to @Mayakp.wiki.

kzimmerman triaged this task as Medium priority.Jun 21 2023, 7:29 PM

We are meeting this week to make a decision to turn off pre-fetch from Wikimedia-side (T218618#9028512 ) to potentially reduce 1-2B requests per month.

Decisions:
With the assumption that we tag pre fetch requests as automated traffic and don't see a corresponding decline in User traffic, we don't see an immediate need to turn off this feature.

  • another reason to continue this feature is because it enhances user experience by reducing load times, which is beneficial for the consumers of our projects and could lead to increase in unique devices (which is an annual plan metric that we are aiming to move up).

Questions:

  1. since a connection is established before a user accesses our page (how this feature works), what happens after the user actually clicks on the links? does this get recorded as a user pageview (even though the browser doesn't have to send a request again) ?

Action Items:

  • What happens when a user clicks on the pre fetched page? Is that recorded as a pageview?
    • Connect with partnerships to do factfinding with Google
    • Dig into our data on user traffic and how we’re segmenting traffic
  • Circle with DE and discuss about this

There are several questions that would be great to get more answers from Partnerships and Google:

  • they should have statistics on the number of pre-fetches to our site (since it goes through their private proxy). does this match our estimated numbers?
  • clarify how they make requests for pages that users navigate to but are already pre-fetched
  • statistics on the speed benefit for users. how large is the benefit of this feature for users (ideally depending on the region)? there is a cost-benefit trade-off (for us). if we had pre-fetched everything for everyone, the time to serve would be very small; however, our infrastructure couldnt cope with this. so how large is the improvement in user-experience compared to the additional cost of serving 2B pageviews per month? for example, if there is 1ms improvment per 1B additional pageviews it would not be worth it. but maybe if it was 100ms per 1B additional pageviews? I could imagine folks from SRE/Traffic have opinions/expertise on this; I am sure they have some metrics/dashboards on latency but I cant find this now. this blogpost mentions some metrics around this (so maybe Brandon might be someone to reach out to about this?)

Action Items:

  • What happens when a user clicks on the pre fetched page? Is that recorded as a pageview?
    • Connect with partnerships to do factfinding with Google
    • Dig into our data on user traffic and how we’re segmenting traffic
  • Circle with DE and discuss about this

Follow up on Action Items

  • Dig into our data on user traffic and how we’re segmenting traffic

discussions with @MGerlach - There seems to also be a smaller increase in user-traffic following a similar pattern (to automated) and is indeed a bit more difficult to understand. some speculation:

  • no decrease in user-traffic: could be explained if the pre-fetch is not getting the "correct" pages, i.e. those that users will navigate to. that means we will get a lot of pre-fetches, but then users might navigate to other pages
  • increase in user-traffic: this could be explained by automated traffic from the pre-fetches which is labeled as user-traffic. I am not sure how the private proxy from chrome works. but on our side, we only see that lots of requests from the pre-fetches will have the same origin (e,g, IP etc.) which our automated traffic algorithm will then mark as automated. however, I could imagine that the algorithm will not work 100% correctly, i.e. some of the pre-fetches might appear as genuine traffic since they are not explicitly labeled as such and we are just applying the heuristics from the bot-detection algorithm.
  • Circle with DE and discuss about this

Had a meeting with @mforns where we saw that User traffic mimics automated traffic patterns and makes us wonder if pre-fetch is showing up as user and the cause for increases in user traffic as well.

image.png (371×1 px, 40 KB)

  1. we need to find a way to tag the prefetch proxy traffic. Ideally in webrequest and other derived pageview tables for easy analysis.
    • this is possible using the 'Sec-Purpose: prefetch; anonymous-client-ip' request header.
    • note that we would not want to change any of our existing dimensions (like agent_type) to indicate prefetch pageviews since this will break our reporting and has consequences on Superset dashboards. Instead find a way to store this in any existing field or create new field
  2. Simulate UA deprecation in non-Chrome browser and run bot pipeline (automated bot, or unique devices) to see the difference.
    • We will get the true number of unique devices; and simulating shows a change of x%
    • If its not, then this will be another piece of evidence that ua deprecation doesnt relate to inc in automated traffic. Confident to discard this hypothesis
    • If yes, then extrapolate for Chromium.
  1. we need to find a way to tag the prefetch proxy traffic. Ideally in webrequest and other derived pageview tables for easy analysis.
    • this is possible using the 'Sec-Purpose: prefetch; anonymous-client-ip' request header.
    • note that we would not want to change any of our existing dimensions (like agent_type) to indicate prefetch pageviews since this will break our reporting and has consequences on Superset dashboards. Instead find a way to store this in any existing field or create new field

The choice here seems be be between categorizing these as automata or spider, because user is definitely not correct.

  • automata is more about automated user agents that we detect out of traffic patterns
  • spider is more about actual web crawlers. This might be a bit of a stretch but these are like real-time spiders, they're essentially pre-crawling the page to give their users a better experience. Of the available options, this seems the best. We could also add a new field, and we should think carefully either way.
  • Dig into our data on user traffic and how we’re segmenting traffic

We were able to look into our traffic and get stats on the amount of prefetch requests, thanks to the implementation in T346463

  • Connect with partnerships to do factfinding with Google

We met with Google on Mar 20, 2024 and were able to get some answers to our questions, but we need to follow up with them via email to decide on whether we should switch off the prefetch feature. see notes doc for more info

Mayakp.wiki renamed this task from Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices to Investigate relation of Prefetch feature to increase in automated traffic and impact on unique devices.Tue, Apr 23, 6:19 PM
Mayakp.wiki removed a project: Product-Analytics.
Mayakp.wiki updated the task description. (Show Details)

We are leaning towards switching off the Prefetch feature since the fraction of requests with no-cookies is 17.5% . numbers in the notebook (last cell)

According to the public documentation, if there is a cookie, a fetch happens, but is not used.

“Chrome currently restricts the usage of Private Prefetch Proxy to websites for which the user has no cookies or other local state. If there is a cookie for a resource, Chrome will make an uncredentialed fetch but will not use the response”