Page MenuHomePhabricator

[SPIKE] Model impact of User-Agent deprecation on top line metrics
Closed, ResolvedPublic5 Estimated Story Points

Description

Some preliminary analysis shows a small reduction in actor_signature entropy as a result of User-Agent deprecation. However, it's not necessarily true that this implies a small change in top line metrics. We use actor_signatures in our pageview and unique devices pipelines. They are computed using the first 200 characters of the user agent string as well as the IP and other details about each request. A large change in User-Agent entropy may cause a smaller change in actor_signature entropy, but have an outsized effect on something like automata detection.

One way to model the impact of the changes is to compare the existing output of our pipelines with a simulated User-Agent deprecation. Currently we do this:

get_actor_signature(ip, user_agent, accept_language, uri_host, uri_query, x_analytics_map) AS actor_signature

And we could instead do something like this:

get_actor_signature(ip, concat(
    user_agent_map['os_family'], '-',
    user_agent_map['browser_family'], '-',
    user_agent_map['browser_major'], '-',
    user_agent_map['wmf_app_version']
), accept_language, uri_host, uri_query, x_analytics_map) AS actor_signature_after_change

We can use that as input into the rest of our pipelines, to estimate impact on top line metrics.

Remaining tasks

  • project plan including:
    • resourcing plan based on skills required and capacity building
    • bounded scope
    • cross-team work for instrumentation outlined
    • and timeline

Details

Other Assignee
Milimetric

Event Timeline

@Milimetric , thank you so much for simulating the User-Agent deprecation impact and sharing your results. Here's what Im planning to do next:
I'll rerun this analysis for previous time periods, before the change was rolled out - June 2022, Feb 2023 and compare it with your results from May 3rd.
i'll share my findings here.
if the differences are less, then we'll check periodically and assign this a low priority but if they are significant enough that might affect trends, we will push for finding a solution.

@Mayakp.wiki: you would be able to get the synthetic actor signature using the user_agent_map we retain in pageview_hourly, but we discard the user_agent string after 90 days, so you can only go back to around February 2023. We didn't observe any loss in Device Family Entropy until around late April (I think this dashboard should show it but it's not loading for me atm).

I think this analysis is useful only as a rough estimate. Because a small drop in actor signature entropy might still hide lots of lost traffic if the signatures we lose are disproportionately responsible for lots of webrequests, for example. So one idea is just to run some of the pipelines all the way from pageview_actor to pageview_hourly and unique devices and compare the results with what we already computed. We could pick March so we have a whole month to compare, but I could use some help picking a small enough range that is still useful while not killing the cluster.

ah yes! we dont have data beyond 90 days and so I re-ran your analysis for hour on a day in Feb 13, 2023 and didnt find any differences from when you ran it for May. Linking the document here.

As discussed in today's sync meeting I have opened a seprate task for us to collab with Research and dig deeper into this based on our observations in the data.
T336715 Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices

I'll work with @kzimmerman to understand if there's anything further we can do on this task or if we can close.

For next steps, we should re-start the discussion to request User-Agent Client-Hints and get High Entropy Hints into webrequest logs T295073 to see if the UA deprecation affected our automated bot detection pipeline and resulted in the changes we are seeing in monthly top line metrics (reported in T336715)

For next steps, we should re-start the discussion to request User-Agent Client-Hints and get High Entropy Hints into webrequest logs T295073

@Mayakp.wiki to clarify, do you mean T301238: UA_CH - Implement getting High Entropy Hints into webrequest logs? The solution we've discussed in T257893: [EPIC] Support User-Agent Client Hints header in CheckUser is to only request client hint headers for actions that CheckUser extension would log (editing, creating an account, logging in).

In practice, this means:

  • anonymous users who start an editing action (whether they complete it or not) or visit Special:CreateAccount or Special:UserLogin (whether they proceed to create an account / log in or not) will send client hint headers
  • logged-in users will always send client hint headers
  • CheckUser won't store client hint headers unless the user does an edit/login/account creation action; but presumably those headers are going to be visible in web request logs somewhere (cc @BBlack)

so in summary: web request logs would have client hint headers for all logged-in users, and for a relatively small subset of anonymous users.

@kostajh : Thank you for clarifying when the hints would be logged. I misunderstood that it would be for everyone who visits our sites.
We are seeing changes in our topline metrics and want to verify if this is due to the impact of UA deprecation or not. This is being investigated in T336715. I am seeing changes in our metrics corresponding to the reduction phases that were rolled out.

Based on the findings of that task we can decide if we should receive client hints, and if it should be enabled for all visitors.

Based on the findings of that task we can decide if we should receive client hints, and if it should be enabled for all visitors.

We can discuss it, but the rough consensus (as I read it) from a lot of discussion in T257893: [EPIC] Support User-Agent Client Hints header in CheckUser is that we should not enable it for all requests, because that runs counter to the intent of client hints overall (reducing passive fingerprinting) and we'd theoretically put ourselves at risk of getting flagged by Chrome as a site that is requesting too much data. (See also the draft Privacy Budget proposal.)

so in summary: web request logs would have client hint headers for all logged-in users, and for a relatively small subset of anonymous users.

Correction: in our current proposed implementation, we should be able to avoid requesting client hint headers for all logged-in users. So we will only request client hint headers for specific actions (editing, logging in, creating an account), and once those actions are done, we'll unset the client hint header request.

from @Mayakp.wiki

...the major concern or change we were seeing in our pageview metrics was the increase in automated traffic.. I paired with Martin in Research and from https://phabricator.wikimedia.org/T341848#9014932 , we believe this is happening because of the Prefetch feature by Google. So I am not sure if the UA deprecation is causing a huge issue but it would be great if DPE ran the tests to confirm this https://phabricator.wikimedia.org/T336084.

Reassigning to @mforns & @Milimetric

VirginiaPoundstone renamed this task from Model impact of User-Agent deprecation on top line metrics to [SPIKE] Model impact of User-Agent deprecation on top line metrics.Aug 25 2023, 5:08 PM

I need a background task while I work on other things. I will take this one if it's OK.
@Milimetric let me know if you want to tackle this.

Milimetric updated Other Assignee, added: Milimetric; removed: mforns.

Here's a spreadsheet with an analysis on the impact of the UserAgent deprecation so far (as of today).
My summarized takeaway is that so far there's not a visible degradation of the automated traffic detection.
https://docs.google.com/spreadsheets/d/1y1mxagwM5FI5y1qQdlM76xeQKgfHCUQtvJbszuozMiA/edit#gid=0

After a bit of investigation, I think there's good news:
IIUC, the Chrome User Agent reduction has already been rolled out completely.

See https://www.chromium.org/updates/ua-reduction/
The rollout of phase 6 (removal of mobile device_family and os_major) matches exactly the drop in User Agent entropy we see in the spreadsheet above.

There's still a phase 7 rollout (the last one). This phase will shut down support for deprecation trial, which is a mechanism where sites can opt out of the User Agent reduction until the last moment.
However, we did not opt out using that mechanism, so phase 7 shouldn't affect us at all.

If my understanding is correct, we should not see further entropy loss in Chrome User Agents.
@Milimetric, does it make sense to you?

It makes sense, @mforns, it's just a little strange, I would've expected the minor versions to be in the first 200 characters of the strings and indeed we saw a drop in the UA string entropy, so it's strange that this doesn't affect everything else. But I believe you and I'll just file that away with other life curiosities :)

I did find T265057: SPIKE: consider problems to data pipelines as a result of reduced user agent entropy in Google Chrome if you want to take a look at that in combination with this.