Page MenuHomePhabricator

Assess impact of ua-parser update on core metrics
Closed, ResolvedPublic8 Estimated Story Points

Description

In T192464, the user agent parsing regexes from ua-parser are being updated for the first time in two years. We need to know whether/how this affects our traffic data, at least for the core metrics.

  • What percentage of global human pageviews (i.e. those with agent_type = 'user', a core metric we report to the board on a monthly basis) are going to be reclassified as spider pageviews?
  • What percentage of monthly Wikipedia uniques devices (a core metric we report to the board on a monthly basis, calculated based on webrequests with agent_type = 'user') is going to be removed by improved spider classification?
  • Are there any browsers or browser versions whose place in the browser support matrix is changing due to improved classification in the new version?

Event Timeline

What percentage of monthly Wikipedia uniques devices (a core metric we report to the board on a monthly basis, calculated based on webrequests with agent_type = 'user') is going to be removed by improved spider classification?

Unique devices traffic should not be affected by this change . Besides the user/bot distinction there is the nocookies flag and also a step to discount users w/o cookies that do loads of requests and thus they are possible bots. Unique devices algorithm does not run on all data marked as "user" pageviews.

What percentage of monthly Wikipedia uniques devices (a core metric we report to the board on a monthly basis, calculated based on webrequests with agent_type = 'user') is going to be removed by improved spider classification?

Unique devices traffic should not be affected by this change . Besides the user/bot distinction there is the nocookies flag and also a step to discount users w/o cookies that do loads of requests and thus they are possible bots. Unique devices algorithm does not run on all data marked as "user" pageviews.

The algorithm does use the agent_type = 'user' condition (https://github.com/wikimedia/analytics-refinery/blob/master/oozie/unique_devices/per_project_family/monthly/unique_devices_per_project_family_monthly.hql#L42 ). I think you are raising an interesting question about whether that's redundant to the nocookie condition and fingerprinting, but without some actual data I would not be comfortable relying on the assumption for our core metrics. Certainly the algorithm itself hasn't relied on it so far.

but without some actual data I would not be comfortable relying on the assumption for our core metrics. Certainly the algorithm itself hasn't relied on it so far.

mmm.. the research for this metric about bots was plentiful and much of the "user" marked traffic is excluding from counting, the bulk the work we did on how did we excluded bot traffic wrongly labeled as “user” is explained here (the nocookie flag exists because of bots and ditto for counting requests with a given signature)

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews/Bots_Research

Milimetric moved this task from Incoming to Data Quality on the Analytics board.

..

mmm.. the research for this metric about bots was plentiful and much of the "user" marked traffic is excluding from counting, the bulk the work we did on how did we excluded bot traffic wrongly labeled as user is explained here:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews/Bots_Research

Yes, I'm familiar with that page. As far as I can see, it doesn't report any results about how many non-nocookie requests might be coming from undetected bots. And in any case, this was 2015 and it's 2018 now.
Apropos, I see that under "worklog" there you mentioned "[then-] recent updates to bot regex that affect this data, pour regex catches more bots via user agent (quite a bit more)", underlining the importance of the present task in general.

Anyway. Thanks for taking this on, let's see what Francisco finds out!

Sampling one hour of activity in all Wikimedia projects.

Total requests: 385M
Requests with a different ua_parsed dictionary: 173M (44%)

Differing requests by OS family:

Screen Shot 2018-05-09 at 12.19.26 PM.png (728×1 px, 65 KB)

The bulk of user agent strings that report a different result from two years ago (94% of differing requests - 163M) are devices running Windows. UAP_core changed the Windows specification so that the version thereof isn't included in the OS family.

E.g. "Windows 10" used to be classified as {"family": "Windows 10", "major": null}. Now they are {"family": "Windows", "major": "10"}

Of the remaining 10M differing requests, I found out this:

  • 4.4M (44%) requests coming from Chrome Mobile whose device wasn't recognised (showed as Generic_Android) now are identified with their correct brand.

Screen Shot 2018-05-09 at 1.21.26 PM.png (706×1 px, 61 KB)

  • 3.2M (32%) requests that used to be identified as Chrome Mobile are now Chrome Mobile Webview
  • 2.8M (28%) requests that used to be attributed to Apple's CFNetwork are now correctly identified as iOS apps, most notably what appears to be the Wikipedia app. Here's the distribution of requests that used to be UA'd as CFNetwork:

Screen Shot 2018-05-09 at 1.10.35 PM.png (708×1 px, 161 KB)

Finally, this is the overall distribution of the 385M (all requests in that period) requests by device, OS, and browser:

https://docs.google.com/spreadsheets/d/15pSfoO65NKH4zMNHC6tS6wCgWrBc6Qhm1TNA7CxKi2Y/edit?usp=sharing

As suggested before, the main impact this change will have will be in that Windows will replace Android as the device family with the most requests, which will more than triple.

@Nuria @Tbayer regarding this change, is there anything else you think I should find out? I'll fix the tests in eventlogging and review Nuria's fix and when you think this is all ready we'll deploy the change.

A note about user agent strings identified as Spider. There don't seem to be breaking changes in bot metrics from the perspective of ua string parsing (in eventlogging we use an additional regex to cover known cases that uaparser doesn't touch for this reason), but here are my findings:

On the previous version of uaparser, from this sample 13.8M requests were identified as spider (about 4% of all requests). The new release catches a total of 13.9M (a 0.7% increase), which indicates a slight improvement in identifying bots. All of these newly identified 133k requests used to have None (or null for refinery) as their device brand. From those 133k they are:

  • 93k requests (70%) previously only shown as Chrome with null device now are marked as Google Web Preview spider user agents
  • 26k requests (19%) are now identified as WhatsApp web preview
  • 11.5k requests (8%) are now identified as preview from the Qwant search engine

On the opposite side, there are only 28 requests that in the new version of uaparser are no longer identified as Spider, and they are all from Chrome Mobile Webview.

Added percentages to all absolute figures to improve clarity.

Super interesting findings, thanks @fdans! CCing @chelsyx regarding the implications for iOS.

To get closer to the objectives of this task as spelled out in the task description, can we repeat the same calculations limited to pageviews (webrequests with is_pageview = true)?

Also, can we document the exact queries/calculation method used, either here or in the spreadsheet (per general practice, but also to potentially enable other to drill down into more specific aspects). It may also be worth applying them to a larger sample (a full day or week), considering the strong periodicities in our webrequest data.

Ok, so let's do the same exercise with only pageviews, in the same one hour sample:

Total pageviews: 31.3M
Pageviews with a different ua_parsed dictionary: 9.6M (30.7%)

Of those differing user agent strings, 8.7M (92.6%) come from Windows. This is the same issue discussed above. Taking a look at that remaining 7.4% (0.9M) of differing pageviews, we can see more or less the same distribution as in the general request analysis. 630K pageviews correspond to Generic Android devices being now correctly identified. The rest seems in check with CFNetwork user agents and Chrome Mobile.

If y'all want to take a look at the sample data, I turned the humble scripts that I used in the first report into a Jupyter Notebook, which you can find in my home folder at notebook1003 (name is pageviews_change_uaparser). This is my first notebook, so forgive the messy/repetitive code and questionable styling, but the data is there if you'd like to extract some more insights.

Thanks @fdans!

Related to question #3 in the task description, I noticed that the number of IE7 pageviews has dropped at lot from May 21 to May 22; it seems these are now counted as (mainly) IE11 (and some as IE8 and IE9):

IE7 views by version, 2018-04-20...2018-05-29.png (810×1 px, 114 KB)

(Source: Turnilo)

Do we know what's going on here?

This is of particular importance because since last fall we have been able to correct the large amount of anomalous IE traffic from Pakistan (that in https://phabricator.wikimedia.org/T157404 we concluded was an artifact, back in April 2017) and some other countries by filtering out IE7 views.

Hey @Tbayer it seems like until we updated the regexes, ua_parser assumed that the major version of Internet Explorer matched that of its browser engine (Trident). As you can see here, this is very much not the case. So pretty much in all the variations that this anomalous traffic manifested itself as, we were parsing them wrong. Here are some test cases:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)

This is the most common user agent string in IE coming from Iran. Prior to this upgrade of UA parser, we were parsing it as 'family': 'IE', 'major': '7', because it would take the version from matching MSIE 7.0. But as you can see in the ua string itself, this is only used to indicate that the agent is compatible with IE7, but the version of IE in particular should be inferred from the version of Trident. Trident 4.0 means Internet Explorer 8, so in the latest version of ua parser we reflect that correctly ('family': 'IE', 'major': '8').

A different case, also very common in Iran/Pakistan/India/Afghanistan:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)

You can see that in this case it's Trident/7.0, which the previous version of ua parser incorrectly equated to IE7. With this change in uap-core, the library makes a way more detailed parsing of Trident browser engines, which make the new version return 'family': 'IE', 'major': '11'.

Thanks for the explanation, @fdans ! It seems like the best option for now regarding T176023 is to convert that kind of pageview_hourly query into an equivalent (if much slower) webrequest query that uses the old ua-parser regex. (Or do you happen to see a better solution?)
[edited to fix wrong link]

@Tbayer can you clarify? I think you're linking to the wrong task.

@Tbayer can you clarify? I think you're linking to the wrong task.

Right, sorry - I fixed the link above.

@Tbayer ok, so getting IE traffic on these countries for the months previous to the update, we can see that the traffic for IE11 was about 1 - 1.5% of all those pageviews.

Screen Shot 2018-06-19 at 6.39.44 PM.png (1×2 px, 351 KB)

I think the right approach is just to assume that 98.5% of the IE traffic in those countries is bogus. Reparsing ua strings with the old regexes in webrequest will be an super long query that would be carrying pretty much the same error as this approach.

Basically, we're going from assuming that all IE7 pageviews are false, to assuming that 98.5% of IE11 pageviews are false.

Vvjjkkii renamed this task from Assess impact of ua-parser update on core metrics to 5tdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii removed fdans as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii added a subscriber: fdans.

@Tbayer ok, so getting IE traffic on these countries for the months previous to the update, we can see that the traffic for IE11 was about 1 - 1.5% of all those pageviews.

Screen Shot 2018-06-19 at 6.39.44 PM.png (1×2 px, 351 KB)

I think the right approach is just to assume that 98.5% of the IE traffic in those countries is bogus. Reparsing ua strings with the old regexes in webrequest will be an super long query that would be carrying pretty much the same error as this approach.

Basically, we're going from assuming that all IE7 pageviews are false, to assuming that 98.5% of IE11 pageviews are false.

Thanks. I'm now trying out excluding all IE traffic from these countries (Iran, Pakistan, Afghanistan).
Excluding just IE11 would not seem sufficient, considering that (as you already indicated above in T193578#4242326 ) the traffic formerly classified as IE7 now falls into several different versions (e.g. besides IE11 also a substantial number for IE8, etc.):

IE pageviews by version Iran, Pakistan, Afghanistan April-June 2018.png (876×1 px, 135 KB)

(Source: Turnilo)

Ping @Tbayer let's mark this as resolved?

We got a lot of good information here (thanks again, also for sharing SWAP notebook!)
But the three questions spelled out in the task description are still not yet marked as resolved. If I have overlooked the answers, please feel free to point that out and tick the corresponding checkboxes (preferably also linking the answers in the task description so folks can find them easily).

Regarding the first one ("What percentage of global human pageviews [...] are going to be reclassified as spider pageviews?"), we have some partial information from your great results in T193578#4196915 . But as I pointed out later that day in T193578#4198163, these still need to applied to the agent_type = 'user' subset. I might take a look at your notebook now to see if I can work it out myself.

Ping @Tbayer let's mark this as resolved?

We got a lot of good information here (thanks again, also for sharing SWAP notebook!)
But the three questions spelled out in the task description are still not yet marked as resolved. If I have overlooked the answers, please feel free to point that out and tick the corresponding checkboxes (preferably also linking the answers in the task description so folks can find them easily).

Regarding the first one ("What percentage of global human pageviews [...] are going to be reclassified as spider pageviews?"), we have some partial information from your great results in T193578#4196915 . But as I pointed out later that day in T193578#4198163, these still need to applied to the agent_type = 'user' subset. I might take a look at your notebook now to see if I can work it out myself.

For the record: At the Wikimania hackathon, @fdans and I looked a bit further into this. To apply the agent_type = 'user' condition one needs to take the Wikimedia-specific spiderPattern from webrequest refinery code and combine it with the (old vs new) general spider detection from ua-parser that @fdans already made available for analysis in his notebook. (I'm not sure about the most efficient way to do this for the refinery code part.)

Wrapping up this task with the remaining questions this week would be great, as it would enable me to provide a better assessment of the current year-over-year traffic traffic trend to the board and on the Audiences page.

I do apologize for the delay on this. I've run the same study on bots as in https://phabricator.wikimedia.org/T193578#4196915 but adding is_pageview=true to the webrequest query.

In this hour sample there are a total of 21,608,228 pageviews. On the previous version of uaparser, from this sample 3.90M pageviews were identified as spider (about 18% of all pageviews). The new release catches a total of 3.91M (a 0.25% increase), which indicates a slight improvement in identifying bots. This is how the user agent family ID of the differing 6,139 pageviews changed:

"Other=>WhatsApp": 4484 (73%)
"Chrome=>Chrome": 1506 (24%)
"Safari=>StatusCakeBot": 125 (2%)
"PingdomBot=>PingdomBot": 24 (0.003%)

The only difference in the case of PingdomBot is that before, even though it had the word "Bot" in its name, it wasn't being identified as such.

Only 3 pageviews were reidentified as non-bot, the three of them being reclassified as Google Mobile Webview.

I do apologize for the delay on this. I've run the same study on bots as in https://phabricator.wikimedia.org/T193578#4196915 but adding is_pageview=true to the webrequest query.

In this hour sample there are a total of 21,608,228 pageviews.

From T193578#4210948 I had understood it was 31.3M pageviews instead - or was that a different sample? (BTW, due to the strong daily and weekly seasonality of our traffic, I would recommend extending the analysis to one week - I can take care of that later if you share the new queries too.)

On the previous version of uaparser, from this sample 3.90M pageviews were identified as spider (about 18% of all pageviews). The new release catches a total of 3.91M (a 0.25% increase), which indicates a slight improvement in identifying bots.

Thanks! So as mentioned earlier, the last remaining step needed for answering the first question posed in the task, i.e. to replicate the impact on spider/non-spider pageviews as we actually measure them, would be to combine ua-parser's spider detection with our own from the webrequest refinery code, or estimate the difference by other means. It appears from your last result that we can ignore the quantitative effect of views being reclassified from bot to non-bot. Hence the impact on our actual pageviews would lie between 0% (in the unlikely case that all the spiders added in the new ua-parser version were already caught by our own regex) and the difference you determined here for the ua-parser-only detection (in the more likely case that non of them had been caught by our own regex). Thoughts?
I should note that for this question, it would also be good to know the impact on subsections of our traffic, e.g. for desktop views only.

the last remaining step needed for answering the first question posed in the task, i.e. to replicate the impact on spider/non-spider pageviews as we actually measure them, would be to combine ua-parser's spider detection with our own from the webrequest refinery code, or estimate the difference by other means.

Since our own regex has not changed the biggest impact the ua parser could have on bot identified pageviews is 0.25% so that is your upper bound, right? (barring aside running those numbers on a longer sample to, as you mention, take into account weekly patterns).
Please see: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql#L99

the last remaining step needed for answering the first question posed in the task, i.e. to replicate the impact on spider/non-spider pageviews as we actually measure them, would be to combine ua-parser's spider detection with our own from the webrequest refinery code, or estimate the difference by other means.

Since our own regex has not changed the biggest impact the ua parser could have on bot identified pageviews is 0.25% so that is your upper bound, right? (barring aside running those numbers on a longer sample to, as you mention, take into account weekly patterns).

Yes exactly, that's what I meant by "the difference you determined here for the ua-parser-only detection" above.

Please see: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql#L99

Yes, that's where both definitions are combined (with "OR"). In T193578#4488445 I already linked to the separate part of the refinery code where the Wikimedia-specific regex is defined.

Given that we have provided an upper bound for the effect of this change, neither @fdans nor me understand what is the ask. The wmf-grown regex has not changed thus its effect on pageviews hasn't either, the ua-parser bot identification has changed and we know its effect (barring weekly seasonality as you mention) can be "as high as .25%" We really feel this quantification is sufficient to account for the "update ua-parser effect".

I've stored the two jupyter notebooks I've used in github: https://github.com/fdansv/ua_compare_notebooks

Here are the increases in bot recognizing after applying the patch, for every day in a week in July:

dayincrease
Sun 80.244%
Mon 90.288%
Tue 100.300%
Wed 110.309%
Thu 120.299%
Fri 130.296%
Sat 140.262%

What percentage of global human pageviews (i.e. those with agent_type = 'user', a core metric we report to the board on a monthly basis) are going to be reclassified as spider pageviews?

We are reclassifying very few requests ~.25% increase on bot traffic over prior numbers. You can use provided notebook and extract more precise answers should you want to do so.

What percentage of monthly Wikipedia uniques devices (a core metric we report to the board on a monthly basis, calculated based on webrequests with agent_type = 'user') is going to be removed by improved spider classification?

Simplifying: If you assume 20% of bot traffic (see: https://bit.ly/2NhD8wD) that leaves you 80% of human traffic . A .25% of increase in bot traffic means a 0.01% decrease of "user" pageview traffic, such a small drop in pageviews is likely to have an unmeasurable effect on unique devices.

Are there any browsers or browser versions whose place in the browser support matrix is changing due to improved classification in the new version?

Probably not, as other than the ie7 ->ie11 shift already mentioned, the major change is "grouping" of windows OS under "windows" rather than being more specific like "windows CE" . See: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os and https://analytics.wikimedia.org/dashboards/browsers/#mobile-site-by-browser and https://www.mediawiki.org/wiki/Compatibility#Modern_(Grade_A) You can follow up with performance team to see if there had been changes to list as of recent.

Nuria set the point value for this task to 8.