Page MenuHomePhabricator

Consult on investigation of relation between UA deprecation and increase in automated traffic
Closed, ResolvedPublic

Description

Support work by @Mayakp.wiki investigating the increase in automated traffic and its relation to UA deprecation. See parent task for more details.

Event Timeline

MGerlach triaged this task as Medium priority.Jul 14 2023, 8:10 AM
MGerlach created this task.
MGerlach changed the subtype of this task from "Spike" to "Task".Jul 14 2023, 8:18 AM

weekly update:

  • after first coordination in the previous week, I spent some time this week to have a more detailed look at the data
  • from this, I have a preliminary hypothesis that the increase in automated traffic is not from UA deprecation but is a result of a new feature in Chromium released in June 2022 introducing speculation rules that "provide a mechanism for web content to permit prefetching or prerendering of certain URLs." there are more details about this feature in a blogpost on private prefetch proxy in Chrome.

Some more detailed thoughts about the hypothesis

  • Starting in June 2022 we see an increase of about +2B pageviews per month marked as automated traffic. ( wikistats: https://stats.wikimedia.org/#/all-projects/reading/total-page-views/normal|bar|2-year|agent~automated|monthly )
  • almost all of this increase can be attributed to the Chrome Mobile browser; other browsers do not show similar increases/decreases (superset - internal only)
  • when zooming in on traffic from Chrome Mobile browser, before June 2022 there was essentially no automated traffic at all. the increase of automated traffic starting in june 2022 is not accompanied with a decrease in user-traffic (superset - internal).
  • in addition, the increase in traffic seems organic in the sense that it similarly affects all regions (continents/countries) and projects
  • this suggests to me that there is additional incoming traffic. this is not consistent with the UA-deprecation hypothesis: the UA-deprecation would lead to wrong labeling of user-traffic as automated traffic such that the increase in automated traffic should be accompanied by a decrease in user-traffic. if anything, we actually see also a slight increase in user-traffic as well. however, this can not be attributed to seasonal fluctuations/trends as no other browser is affected by similar changes.
  • This leaves the question where these additional pageviews couldcome from? I checked the release notes of new versions of Chromium around the time we see the increase (June 2022) https://blog.chromium.org/2022/06/ . This mentions a new feature called Speculation rules which refers to prefetching of certain URLs. The idea is explained in more detail in this blogpost.

Starting with Chrome 103 for Android, Chrome will gradually roll out a private prefetch proxy feature to speed up out-going navigations from Google Search and other participating websites by 30% at the median. This private prefetch proxy feature allows the prefetching of cross-origin content without exposing user information to the destination website until the user navigates.

While I havent grasped the details of what the feature does, it is consistent with our observations:

  • start of additional automated pageviews in June 2022 and only for Chrome Mobile (when feature seems to have been introduced for that specific browser)
  • access from all geographic regions/to all projects because it originates with actual users
  • the pageviews are marked as automated because the are sent via a private proxy ("the proxy inherently prevents the destination server from seeing the user's IP address.") and additional efforts to prevent user identification ("For example, the User-Agent header sent by prefetch proxy only carries limited information.")

weekly update:

  • Scheduled coordination meeting to discuss next steps and future planned work.

weekly update:

  • no update (follow-up meeting is scheduled for next week)

weekly update:

  • had a meeting where we discussed next steps in the work T336715#9132510
  • one question was whether we should turn off pre-fetching from Wikimedia side; no decision was made yet since it requires some input from other teams such as Data Engineering/Partnerships (outside of the scope of this task)
  • one possible follow-up work was to clarify whether pre-fetched and then visited pages are also logged as pageviews with user_agent=”user” (we are sure that pre-fetched pages are logged as automated but not sure whether follow-up navigation to that page also leads to requests labeled as "user")

@MGerlach does the pre-fetch traffic have headers that can identify it as such as it comes through as webrequests?

@MGerlach does the pre-fetch traffic have headers that can identify it as such as it comes through as webrequests?

I think they appear in 'Sec-Purpose: prefetch; anonymous-client-ip' request header.
the documentation in their repository says:

"""publishers can opt-out for individual requests, for example, when dealing with temporary traffic spikes or other issues. Publishers should look for the Purpose: prefetch or the new 'Sec-Purpose: prefetch; anonymous-client-ip' request header and respond with an HTTP 403 (Forbidden) (see location for an example use case)."""

weekly update:

  • read through the documentation in more detail. the aim was to check if there are any headers from pre-fetch traffic such that they could be potentially identified. (see above T341848#9143300)
  • one potential signature is the 'Sec-Purpose: prefetch; anonymous-client-ip' request header; however, I am not sure if and how this would appear in our webrequest logs

weekly update:

  • interesting analysis in T336084#9157037 shows that UA deprecation rollout phase 6 co-occurs with a pronounced drop in user-agent entropy. however, the increase in automated traffic appears several months earlier. this seems to confirm that the increase in automated traffic is not due to the UA deprecation. this confirms the hypothesis from this analysis that there is a different reason (prefetch proxy)
  • coordinated with @Mayakp.wiki and we agreed that the next steps dont require any support from Research at the moment as we have likely found the cause for the increase. how to address the issue will be tracked in other tasks (and likely by other teams).
  • therefore, I am considering the task as done and thus close.