Page MenuHomePhabricator

Investigate surprising "10% Other" portion of Analytics Browsers report
Closed, ResolvedPublic13 Estimated Story Points

Assigned To
Authored By
Krinkle
Jul 19 2023, 3:31 PM
Referenced Files
F56766930: desktop-os.png
Jul 29 2024, 3:11 PM
F56766929: desktop-browser.png
Jul 29 2024, 3:11 PM
F56766931: mobile-os.png
Jul 29 2024, 3:11 PM
F56766933: mobile-browser.png
Jul 29 2024, 3:11 PM
F56766932: all-os.png
Jul 29 2024, 3:11 PM
F56766937: all-browser.png
Jul 29 2024, 3:11 PM
F56377471: Crazy Query Plan.png
Jul 12 2024, 2:46 PM

Description

Background

For the past 8 years or so, it's been the case that the Browsers report at analytics.wikimedia.org reports a portion of 10-12% "Other".

In my view, this continuously calls into question the validity of the data and makes it difficult to trust. This lack of trust is due to the following assumptions being difficult to believe in their totality. The below assumptions are based solely on my own interpretations, and I expect these assumptions to be incorrect or incomplete. Having said that, I've shared these interpretations many times over the years, and have yet to encounter an alternate explanation, including from conversations with PMs, Directors, and engineers working on the measurements methods and underlying datasets.

  1. 100% = only from "user" type (excludes known bots and spiders).
  2. 100% = only page views (excludes page loads during other actions such as edit, history, search, special pages, and non-HTML requests; regardless of user agent string).
  3. Other = other browser families, representing a long tail of lesser known and "fake" browsers.

I have no trouble believing there are thousands or even millions of lesser known browser families or distinct (unparsable) user agent strings seen in pageview traffic.

The part that's unbelievable is that we have a combined 12% of (assumed-human) page view traffic worldwide coming from lesser known browsers. If true, this would make a very significant thing to talk about publicly and widely to support organisations like Open Web Advocacy speak up for browser diversity. And we should then strive to try to publish some kind of dataset that provides more insight into what some of the "biggest" of the smallest browsers are.

What's also hard to believe is that the market share of the main browsers are as high or as low as reported, based on other information available.

  • 31.9% Chrome Mobile
  • 21.2% Mobile Safari
  • 16.4% Chrome
  • 11.7% Other
  • 3.4% Edge
  • 3.3% Firefox

Impact

When we make product decisions around which part of our global audience we can support at a certain level (Basic/Modern, Grade C/A, as per mw:Compatibility#Browsers), we theoretical maximum of 88%. That's a pretty low ceiling.

Based on stats.wikimedia.org, that's more than 2 billion page views every month (of 24 billion), and 190 milion unique devices (of 1.6 billion); that we can't account for.

Examples:

Compared to other data

https://en.wikipedia.org/wiki/Usage_share_of_web_browsers

SourceChromeFirefoxOther
analytics.wikimedia.org (Week of 2023-07-09)48%3.3%12%
StatCounter (Worldwide: June 2023)62.58%2.81%1.37%
W3Counter (Dec 2022)71.2%3.0%5.7%

Investigation so far

I've raised this numerous times internally, including to Jon Katz (then-Director in WMF Product) in 2020, in the hopes someone could analyze this.

The issue is also raised regularly when I publicly share Wikimedia'a browser (example 1). Most recently, Sime Vidas (of WebPlatformNews fame) raised it again on social media (example 2).

So, I'll try to investigate it now and report my findings here.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update airflow dag that computes the metricsrepos/data-engineering/airflow-dags!800milimetricbump-browser-metric-jarmain
Clean up the browser general DAGrepos/data-engineering/airflow-dags!797milimetricfinalize-browser-general-fixmain
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@SNowick_WMF @nettrom_WMF @mpopov can you share ways this change would positively or negatively affect your work supporting product decision makers.

Idea 1 has the level of information and detail that my teams would find useful, especially in conversations around browser versions, support levels, and such.

@Krinkle: Thank you for sharing the results of your queries in a manner consistent with the Data Publication Guidelines.

From Product Analytics side, idea 1 would make it less likely for P+T teams to request our support for self-serviceable questions around browser/OS usage.

I can see Idea 1 potentially being helpful to teams answering questions like these:
T303301: Support on understanding traffic and behaviors for users on legacy browsers (somewhat timely)
T250385: Estimate readership from Internet Explorer 8
themselves, although most folks at the Foundation have access to tools like Turnilo (which was used to answer that first one) and Superset. Staff also have access to more detailed data than what is released publicly. Maybe the browsers report on analytics.wikimedia.org is also more accessible / easier to use to staff than Turnilo and Superset are? (Although I'd like to get to a state where everyone in P+T is data-literate and proficient with both of those.)

I think where the public data is really helpful is for conversations with communities / product decisions made made in collaboration with communities, since volunteers don't have access to the more detailed data that is only available internally.

Idea 1 ("Roll up") (Good simplicity, High detail). We release 1 main dataset, like before. But rather than only 1 large global "Other" bucket, we create smaller "Other" buckets within a given OS or browser family. This means we would, for example, have "Firefox 8%" made up of:

3% (Firefox 112, Windows 10)
2% (Firefox 112, macOS)
2% (Firefox 112, other), covers OS/version combos too small, including uncommon Linux/Windows versions.
1% (Firefox other, Windows 10), covers uncommon versions of Firefox on Windows 10.

Seems like folks are onboard for Idea 1. Speak now or forever hold your peace. Will likely start work on this starting June 16th.

CC @Htriedman

Reading up on this thread now! I think that idea 1 sounds good and shouldn't be privacy-breaking if we report counts/percentages that are above 250 views. Given that we get something like 600m views per day, that lower threshold accounts for 0.000042% of our traffic.

@Krinkle

I think we could modify (split) the schema of the table wmf.browser_general and recompute it, so that idea 2 does indeed drastically reduce the other bucket for charts that are browser-only or OS-only.
We could just have 3 tables:

  • one grouped by browser_family and browser_major (would back up browser-based charts)
  • one grouped by os_family and os_major (would back up OS-based charts)
  • the one that we already have now, grouped by all fields (would back up browser+OS-based charts)

When aggregating the first 2 tables, since there are fewer dimensions, buckets would be bigger, and I think applying the same 0.1% threshold would result in much smaller "Other" bucket indeed.
The only drawback of this solution is that if we want the change to apply retroactively, we'll have to re-run the calculations since 2015, but I think it is possible.


Also,
I think "Other" is not a good name for that bucket.
It implies that everything that is in it, is different from what is outside of it.
But actually, the current "Other" bucket in, say, ("Windows", "iOS", "Android", "Other", ...) does mostly contain requests from "Windows", "iOS", "Android", etc.
So, I think the name is misleading.
I'm not sure what could be a good name, but some ideas: "Unreported", "Hidden", "Protected", "Small buckets", "Privacy sensitive", "Unknown", ...?

Made up some slides to help think about this data:

https://docs.google.com/presentation/d/1jnL4E0LVEnh1WsNK7IM-mS2tQpIXxZJN2tz7iP05_OA/edit#slide=id.g2e65179e474_0_328

What Marcel says above I think is more valuable for the browser-only and OS-only reports than anything we can do with the combined report. Because of what I show in the slides. Adding to that, for the combined report, we could continue with idea 1 in two ways:

  1. keep browser_general the way it is now, with a threshold of 0.1%
  2. recompute browser_general with a threshold of 250 views

If we recompute, we'd have to bucket at some point, not for privacy but for readability of the report. So a question for @Timo and others, what level of detail is useful here? Do we need a list with everything that is above 250 views? I looked at just Chrome and Firefox browser families and for 1 day we have 1500 different combinations of all dimensions that have more than 250 hits. That seems more like noise than information to me, so what would be a good size of this list? I can then try to optimize the output towards that.

I think "Other" is not a good name for that bucket. […]
But actually, the current "Other" bucket in, say, ("Windows", "iOS", "Android", "Other", ...) does mostly contain requests from "Windows", "iOS", "Android", etc.
So, I think the name is misleading. […]

I'd agree. My thinking around "Idea 1" is that this would become less common. The global "Other" bucket would, I believe, mostly be reported under buckets like "Windows 10 / Firefox other". This is based on the assumption that most lost buckets today are in the long tail of browser versions, under OS versions and brower families that themselves are fairly large. A small contingent of noise and uncommon browser or OS families would continue to fall under "Other" of course.

What I am starting to realize, though, is that we go with "Idea 1" and follow its implication of continuing a single dataset in the same flat format and re-usable format we have today, that we have a choice to make. Rolling-up browser_version first or os_version first seems to produce the same output in my limited testing. I expect the main difference and trade-off will be in rolling up browser_family vs os_family first. This will dictate whether you end up with more "Other / real browser" rows or more "Real OS / Other" rows.

My instinct is that we have far more diversity in browser_family/version (and thus more loss due to division) than in os_family/version, which is more homogenous and resistent to data loss, I think? If that's the case, then rolling up by os_family would be more valuable I imagine. Example at P65115.

ok, I have some results for us to peruse, from rolling up in different ways. First of all, my query so we can debate whether or not it's accurate.

-- CTEs omitted for brevity, quick descriptions here (all data from 2024-06-10):
-- pageview_sanitized: filtered to (user) pageviews on (desktop, mobile_web)
--                     AGGREGATED by all dimensions, filtered to HAVING > 250 views
-- totals: the total view_count for all requests on this day, no filtering

SELECT
    'all' as aggregation_type,
    p.access_method,
    p.os_family,
    p.os_major,
    p.browser_family,
    p.browser_major,
    p.view_count,
    p.view_count * 100 / totals.total_view_count AS percent_of_total
FROM
    pageview_sanitized p
    CROSS JOIN totals
WHERE
    p.view_count * 100 / totals.total_view_count > 0.1

UNION ALL

SELECT
    CASE
        WHEN os_family is null THEN 'only_browser_family'
        WHEN browser_family is null THEN 'only_os_family'
        WHEN browser_major is null and os_major is null THEN 'browser_and_os_families'
        WHEN os_major is null THEN 'all_but_os_major'
        WHEN browser_major is null THEN 'all_but_browser_major'
        ELSE 'unknown'
    END as aggregation_type,
    p.access_method,
    COALESCE(p.os_family, 'other-families') as os_family,
    COALESCE(p.os_major, 'other-versions') as os_major,
    COALESCE(p.browser_family, 'other-families') as browser_family,
    COALESCE(p.browser_major, 'other-versions') as browser_major,
    sum(p.view_count) as view_count,
    sum(p.view_count) * 100 / MIN(totals.total_view_count) AS percent_of_total
FROM
    pageview_sanitized p
    CROSS JOIN totals
WHERE
    p.view_count * 100 / totals.total_view_count <= 0.1
GROUP BY
    access_method,
    os_family,
    os_major,
    browser_family,
    browser_major
GROUPING SETS (
    (access_method, os_family, browser_family),                 -- 'browser_and_os_families'
    (access_method, os_family, os_major, browser_family),       -- 'all_but_browser_major'
    (access_method, os_family, browser_family, browser_major),  -- 'all_but_os_major'
    (access_method, browser_family),                            -- 'only_browser_family'
    (access_method, os_family)                                  -- 'only_os_family'
)
HAVING
    sum(p.view_count) * 100 / MIN(totals.total_view_count) > 0.1

So, the first SELECT grabs all data which, aggregated by all columns (os family, browser family, os major, browser major) represent more than 0.1% of the total pageviews on this day.

And the second SELECT then tries five different grouping sets and labels results by them:

(access_method, os_family, browser_family),                 -- 'browser_and_os_families'
(access_method, os_family, os_major, browser_family),       -- 'all_but_browser_major'
(access_method, os_family, browser_family, browser_major),  -- 'all_but_os_major'
(access_method, browser_family),                            -- 'only_browser_family'
(access_method, os_family)                                  -- 'only_os_family'

In the next comment I'll dump all the data so we can all look at it and make decisions. For those who want to look, I've inserted it in an Iceberg table, analytics_iceberg.milimetric.bg_01. Access this on SQLLab, or from a stat machine:

kinit
...
presto
use analytics_iceberg.milimetric;
desc bg_01;

This spreadsheet (1) has all the different aggregations in separate sheets. The name of the sheet is the aggregation type. Described here:

allaggregate by browser, br. major, os, os major, and keep only if total views > 0.1%
browser_and_os_familiesaggregate anything NOT > 0.1% by browser and os families and keep results that are now > 0.1%
all_but_browser_majoraggregate anything NOT > 0.1% by all but browser major and keep results that are now > 0.1%
all_but_os_majoraggregate anything NOT > 0.1% by all but os major and keep results that are now > 0.1%
only_browser_familyaggregate anything NOT > 0.1% by browser family and keep results that are now > 0.1%
only_os_familyaggregate anything NOT > 0.1% by os family and keep results that are now > 0.1%

[1] https://docs.google.com/spreadsheets/d/1a6JiDupQn4I7JujyoIvcR8oDJ942DPYTyCBQ_P60ARo/edit?usp=sharing

The long and the short of it is that we can get that "other" to about 2% if we simply roll up remaining data by browser family and os family. We could get fancier but let's see what folks think about just this approach.

From a discussion with @Krinkle about the data, a preliminary idea of how to roll up is:

first ( os, os_major, browser, browser_major ), the rest by ( os, browser, browser_major ), the rest by ( os, browser ), the rest by ( browser ) and everything else by "Other" where we will think very hard what word to use instead of "Other".

By "the rest" here we mean everything that didn't previously roll up over the threshold. We discussed threshold a bit, and found that if an item has more than 100k hits in a week, it feels substantial enough to include. This corresponds to roughly 0.01%, an order of magnitude more than before. I'll run some queries and see where this lands us in terms of output and what assumptions about our current jobs would need to be re-evaluated as a result. For intuition, 100k hits in a week is roughly 10 hits per minute, so it makes sense to report on these.

We get a ton more detailed results this way, and the total coverage increases to 99.7%. Still not 99.9%, but I think we may have too much detail at some point. I'm fairly happy with these results, and I'm going to prepare the new browser general query as a gerrit change. It'll be good to get some review.

 select coalesce(aggregation_type, 'overall') as agg_type,
        sum(view_count) views,
        sum(percent_of_total) percent_of_total,
        count(*) as distinct_results
   from bg_15k
  group by aggregation_type
  grouping sets ((), (aggregation_type))
;

Below, agg_type says which columns we're aggregating by, with overall being an aggregate of the other four aggregates. We progressively aggregate as described in the previous comment.

agg_typeviewspercent_of_totaldistinct_results
overall49850242999.733854553588781182
browser_major_and_os_major (all)48533809897.10010715965323904
all_but_os_major53060511.0615653764205122209
browser_and_os_families77060581.541727428079431762
only_browser_family1522220.0304545894356241947

The simpler way to do this, just two phases as opposed to progressive, gets us fairly similar results, with about 200 fewer rows which are all detailing specific browser versions.

select coalesce(aggregation_type, 'overall bg_250') as agg_type,
       sum(view_count) views,
       sum(percent_of_total) percent_of_total,
       count(*) as distinct_results
  from bg_15k_simple
 group by aggregation_type
 grouping sets ((), (aggregation_type))
agg_typeviewspercent_of_totaldistinct_results
overall49840874499.71511128250266979
browser_major_and_os_major (all)48533809897.10010715965323904
browser_and_os_families130706462.61500412284941575

So cool @Milimetric! Thanks for the data study.
The process makes sense to me!
Only question I have is: the query cuts the data 3 times,
once in pageview_sanitized (mentioned in the CTE description "filtered to HAVING > 250 views");
once in the first select "p.view_count * 100 / totals.total_view_count > 0.1";
and lastly in the second select "sum(p.view_count) * 100 / MIN(totals.total_view_count) > 0.1".
Wouldn't it be better to just cut once at the very end? Wouldn't it further reduce the Other bucket size?

Great question, @mforns. This was mostly for performance reasons. I couldn't find a way to get Spark to optimally work on the full day of pageviews without first aggregating it like this to > 250. But the execution plan I ended up with looks pretty wild. Let's talk tomorrow when you have some time. I'm attaching the change here.

Change #1049281 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] [WIP] Implement new way to aggregate browser statistics

https://gerrit.wikimedia.org/r/1049281

Milimetric set the point value for this task to 13.Jul 3 2024, 7:50 PM

Apologies for the week delay here, I was out sick, picking it back up soon.

Ok, sent updated code, it's fast now due to a CACHE statement, but that doesn't change the query plan which is still absolutely nuts, check this out:

Crazy Query Plan.png (4×3 px, 1 MB)

Heya @Milimetric, sorry for taking so long to review this.
I left a comment and a +1, I think that the code looks great and that we can deploy this 👍
This new query is so cool! Kudos :-)

ok, moving to ready to deploy. I'm going to ping @Krinkle one more time for data review. I executed this as I was testing and the results are available in milimetric.browser_general_test. You can query this like this:

https://superset.wikimedia.org/sqllab/?savedQueryId=928 (using presto_analytics_iceberg) as the database, since it's an iceberg table like in prod)

Or from a stat box you can do:

presto
use analytics_iceberg.milimetric;

 select *
   from milimetric.browser_general_test
  where day = date '2024-06-14'
  order by view_count asc

And for dates, I ran 2024-06-10 through 2024-06-17, so those should all be available. The "leftover" or "Other" bucket is now identified by this query:

select *
  from milimetric.browser_general_test
 where day = date '2024-06-17'
   and os_family = 'Redacted'
   and os_major = 'Redacted'
   and browser_family = 'Redacted'
   and browser_major = 'Redacted'
 order by view_count asc

And for the 17th of June, it's a total of about 1.2 million hits, or about 0.26%. Let me know what you think and if I can help make the data more accessible for testing.

👍 This is great!

First impressions (NOTE: This is only about a two week period, and it's they're not the same two week period):

  • For desktop, the proportion of "Other" was actually even higher at 15%.
  • For browser usage, the 11% Other appears to me made up of about ~2% for the top three (Chrome, Safari, Firefox), with the remaining ~9% spread over smaller browsers, many of which now get their own name represented, which is very nice to see.
  • It turns how Firefox may be more popular than Microsoft Edge on desktop with Firefox 13.5% vs Edge 11.9%.
  • For OS usage, the "8.6% Other" has largely gone to Windows, which received a huge boost from 63% to 71.8%. It's not surprising that the OS numbers went only to a handful of existing names, given that we don't really have a prominent long tail of OSes (that is, given Linux distributions and brandings report together under the unified "Linux", as being among the first privacy-preserving UA string portions that are frozen and unified).

Details:

I've ran a couple queries to compare with what's on https://analytics.wikimedia.org/dashboards/browsers/. Also, I'm preserving a visual copy here for preservation, because, assuming the change is retroactive, this'll make for a nice comparison post-deployment.

For now I'm mainly looking at how much the top three are growing (i.e. how much of the old tail isn't really "Other"), and how much Other/Redacted shrink by.

viewcurrent piecurrent data (2024-06-24-2024-07-21)new data from milimetric.browser_general_test
all-browser
all-browser.png (1×1 px, 163 KB)
Chrome Mobile 29%, Mobile Safari 23%, …, Other 11%, Firefox 3.5%Chrome Mobile 30.6%, Mobile Safari 22.4%, …, Firefox 4.6%, Redacted 0.2%, Other 0.2%
all-os
all-os.png (1×1 px, 102 KB)
Android 34%, iOS 29%, Other 11%, …, Linux 0.82%Android 37.0%, iOS 28.9%, …, Linux 1.5%, Redacted 0.3%, Other 0.3%
mobile-browser
mobile-browser.png (1×1 px, 143 KB)
Chrome Mobile 43%, Mobile Safari 34%, Other 8.6%Chrome Mobile 45.5%, Mobile Safari 33.4%, …, Redacted 0.2%
mobile-os
mobile-os.png (1×1 px, 92 KB)
Android 49%, iOS 42%, Other 8.6%Android 55.0%, iOS 43.1%, Redacted 0.2%, Other 0.1%
desktop-browser
desktop-browser.png (1×1 px, 128 KB)
Chrome 49%, …, Other 15%, Edge 11%, Firefox 11%, Safari 9.9%Chrome 56.2%, Firefox 13.5%, Edge 11.9% , Safari 10.4%, …, Redacted 0.3%
desktop-os
desktop-os.png (1×1 px, 89 KB)
Windows 63%, macOS 18%, Other 15%, Linux 2.4%, Chrome OS 0.75%, Ubuntu 0.29%Windows 71.8%, macOS 19.9%, Linux 3.9%, Chrome OS 1.4%, Other 0.5%, Redacted 0.5%, Ubuntu 0.5%
My queries
krinkle@stat1011$ presto
presto> use analytics_iceberg.milimetric;

presto> select sum(view_count) from milimetric.browser_general_test;
 3790423883 

presto> select access_method, sum(view_count) from milimetric.browser_general_test group by access_method;
 desktop       | 1255768536 
 mobile web    | 2534655347 

presto:milimetric> select browser_family, CAST(1000*sum(view_count)/3790423883 AS DOUBLE)/10 _pc from milimetric.browser_general_test group by browser_family order by _pc desc limit 20;
    browser_family     | _pc  
-----------------------+------
 Chrome Mobile         | 30.6 
 Mobile Safari         | 22.4 
…
 Firefox               |  4.6 
 Other                 |  0.2 
 Redacted              |  0.2 

presto:milimetric> select os_family, CAST(1000*sum(view_count)/3790423883 AS DOUBLE)/10 _pc from milimetric.browser_general_test group by os_family order by _pc desc limit 20;
   os_family   | _pc  
---------------+------
 Android       | 37.0 
 iOS           | 28.9 
 Windows       | 24.2 
 Mac OS X      |  6.9 
 Linux         |  1.5 
…
 Redacted      |  0.3 
 Other         |  0.3 

presto:milimetric> select browser_family, CAST(1000*sum(view_count)/2534655347 AS DOUBLE)/10 _pc from milimetric.browser_general_test where access_method='mobile web' group by browser_family order by _pc desc limit 20;
       browser_family       | _pc  
----------------------------+------
 Chrome Mobile              | 45.5 
 Mobile Safari              | 33.4 
 Chrome Mobile iOS          |  4.9 
 Samsung Internet           |  3.7 
 Google                     |  3.7 
 Chrome                     |  2.3 
 Firefox Mobile             |  1.3 
 Chrome Mobile WebView      |  1.0 
 Opera Mobile               |  0.5 
 Yandex Browser             |  0.5 
 Edge Mobile                |  0.3 
 Safari                     |  0.2 
 Crosswalk                  |  0.2 
 Firefox iOS                |  0.2 
 Redacted                   |  0.2 

presto:milimetric> select os_family, CAST(1000*sum(view_count)/2534655347 AS DOUBLE)/10 _pc from milimetric.browser_general_test where access_method='mobile web' group by os_family order by _pc desc limit 20;
   os_family   | _pc  
---------------+------
 Android       | 55.0 
 iOS           | 43.1 
 Windows       |  0.7 
 Mac OS X      |  0.4 
 Redacted      |  0.2 
 Linux         |  0.2 
 Other         |  0.1 

presto:milimetric> select browser_family, CAST(1000*sum(view_count)/1255768536 AS DOUBLE)/10 _pc from milimetric.browser_general_test where access_method='desktop' group by browser_family order by _pc desc limit 20;
  browser_family  | _pc  
------------------+------
 Chrome           | 56.2 
 Firefox          | 13.5 
 Edge             | 11.9 
 Safari           | 10.4 
 Opera            |  2.6 
 IE               |  1.2 
 Yandex Browser   |  0.9 
 Other            |  0.7 
 Apple Mail       |  0.4 
 Chrome Mobile    |  0.4 
 Mobile Safari    |  0.3 
 Redacted         |  0.3 

presto:milimetric> select os_family, CAST(1000*sum(view_count)/1255768536 AS DOUBLE)/10 _pc from milimetric.browser_general_test where access_method='desktop' group by os_family order by _pc desc limit 20;
 os_family | _pc  
-----------+------
 Windows   | 71.8 
 Mac OS X  | 19.9 
 Linux     |  3.9 
 Chrome OS |  1.4 
 Android   |  0.6 
 Other     |  0.5 
 Redacted  |  0.5 
 Ubuntu    |  0.5 
 iOS       |  0.4

great, moving this to get deployed. Steps will be:

  • make a backup of the existing browser reports and a new temporary dashboard to point to them, so we have something to compare
  • backup browser/general data to hdfs:///wmf/data/archive/browser/general_as_of_2024-08-07
  • deploy the new job, let it overwrite the existing output and clear going backwards in history until it's done. Restore from backup above if something goes wrong and revert the script to compute it.
  • vet data one more time, now that we have more of it and we can do longitudinal queries to make sure no weird anomalies were introduced (ideally do this while it's running to head off any big problems)
  • compare dashboards
  • fix pivot job (this is throwing errors on the new data)
  • write a little note about it

Change #1059914 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/analytics.wikimedia.org@master] Add temporary dashboard pointing to old data

https://gerrit.wikimedia.org/r/1059914

Change #1059914 merged by Milimetric:

[analytics/analytics.wikimedia.org@master] Add temporary dashboard pointing to old data

https://gerrit.wikimedia.org/r/1059914

Change #1049281 merged by Milimetric:

[analytics/refinery@master] Implement new way to aggregate browser statistics

https://gerrit.wikimedia.org/r/1049281

Status update on this: the new job is running, I'm going to keep it here until we vet the data. But new data should start showing up right away, and we can compare dashboards side by side and day by day:

UPDATE: the below didn't work, I just ended up deleting the DAG and setting its start date to 2015-06-01

For reference, I cleared all the tasks that this dag ran, and that will refresh data for 2 years. We can decide then if we want to do the full history:

analytics@an-launcher1002:/home/milimetric$ airflow-analytics tasks clear -e 2024-08-05 -s 2015-06-01 browser_general_daily

The backfill job should be done sometime this weekend, and I'll rerun the weekly job then.

Change #1062044 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] Remove scripts related to old hive version

https://gerrit.wikimedia.org/r/1062044

Change #1062044 merged by Milimetric:

[analytics/refinery@master] Remove scripts related to old hive version

https://gerrit.wikimedia.org/r/1062044

the new graphs are up. The pivot transformation failed for all the browser family reports, so I'm still fixing that. But, for example, we can now compare these two:

old data: https://analytics.wikimedia.org/dashboards/browsers-previous/#mobile-site-by-browser/browser-family-timeseries
new data: https://analytics.wikimedia.org/dashboards/browsers/#mobile-site-by-browser/browser-family-timeseries

NOTE: With the new detail, some of the line graphs are automatically hiding lines. Dashiki automatically shows only the top 5 if there are more than 40. This is hard-coded but can be easily changed or made configurable, let me know what you think.

Hm.. it seems the "Other" bucket has grown slightly larger than our predictions of 0.26% prediction at T342267#9998984. That could be fine, but wanted to share it in case it's surprising:

  • All by browser: Other 1.0%
  • All by OS: Other 0.89%
  • Desktop by OS: Other 1.2%
  • Desktop by browser: Other 1.5%

This is from the new data in sunburst graphs for 2024-07-08—today (i.e the default).

From what I can see, "Redacted" is used for browser versions under a browser family, or under "Other", and e.g. for OSes under a browser (in "by OS and browser"). The top level has no "Redacted" afaik. But either way, that would be in addition to the above, so for now I'm looking at the top-level "Other" only.

Hm.. it seems the "Other" bucket has grown slightly larger than our predictions of 0.26% prediction at T342267#9998984

This is going to probably confuse a lot of people so I have to be very careful in wording and documentation. But I think this makes sense.

The "Other" that we're seeing now is just coming from UA parser actually identifying the stuff as "Other". There was some old data remaining because of a few failed tasks (all cleaned up now, should sync soon).

The thing that we figured would be ~ 0.26% in total would be all the items with "Redacted" in every dimension added together. So I see values like that, for example on 2024-05-05 in this table.

If that's still ok, then we're good. If you think it's too confusing, let's talk.

Hm.. it seems the "Other" bucket has grown slightly larger than our predictions of 0.26% prediction at T342267#9998984

[…] The "Other" that we're seeing now is just coming from UA parser actually identifying the stuff as "Other". […]

The thing that we figured would be ~ 0.26% in total would be all the items with "Redacted" in every dimension added together. […]

I see. This'll take some getting used to, but I agree this makes sense.

So to confirm, this means:

  • over 1% of page views (after deducting known bots and spiders) are coming from clients with user agents that are entirely unknown to ua-parser. That is, the "Other" is already there in the raw wmf.webrequest_text dataset, and we've not created or normalized anything else to "Other".
  • 0.26% is "Redacted" where we replace/normalize/summarise for privacy reasons browser/OS names in our pipeline.

I think that's a great outcome for this task. It's useful to learn that that we have 1% of "real" Other in our "user" pageviews. It's higher than might seem ideal, but it reflects the reality, and not an isuse with the Browsers dataset..

I'm curious what common patterns in recent unsanitized webrequest data might exist where user_agent_map.browser_family=Other. This may inform updates to ua-parser. Or, perhaps more likely, improvements to our Bot/Spider definition. Considering how large the list of recognised browsers is in the upstream ua-parser software, it seems unlikely to me that we'd have so many millions of pageviews every week from truly unknown browsers. I suspect these are ad-hoc scripts and bots scraping Wikipedia with novel/unique UA strings that don't follow the convention of including the word "Bot", and are not prominent enough to be explicitly listed as spiders/bots by us. Maybe? (The assumption I'm making is that we default a pageview to "user" unless classified as "bot" or "spider", and that we don't classify unknown UAs as bot or spider currently. I've not verified this assumption.)

Anyway, that's for another time :)

So to confirm, this means:

  • over 1% of page views (after deducting known bots and spiders) are coming from clients with user agents that are entirely unknown to ua-parser. That is, the "Other" is already there in the raw wmf.webrequest_text dataset, and we've not created or normalized anything else to "Other".
  • 0.26% is "Redacted" where we replace/normalize/summarise for privacy reasons browser/OS names in our pipeline.

Exactly

I think that's a great outcome for this task. It's useful to learn that that we have 1% of "real" Other in our "user" pageviews. It's higher than might seem ideal, but it reflects the reality, and not an isuse with the Browsers dataset..

Thanks so much for sticking with me throughout it, I'm happy with the result too.

I'm curious what common patterns in recent unsanitized webrequest data might exist where user_agent_map.browser_family=Other. This may inform updates to ua-parser. Or, perhaps more likely, improvements to our Bot/Spider definition. Considering how large the list of recognised browsers is in the upstream ua-parser software, it seems unlikely to me that we'd have so many millions of pageviews every week from truly unknown browsers. I suspect these are ad-hoc scripts and bots scraping Wikipedia with novel/unique UA strings that don't follow the convention of including the word "Bot", and are not prominent enough to be explicitly listed as spiders/bots by us. Maybe? (The assumption I'm making is that we default a pageview to "user" unless classified as "bot" or "spider", and that we don't classify unknown UAs as bot or spider currently. I've not verified this assumption.)

Anyway, that's for another time :)

That's a very good guess, and now we're not blind to it anymore. I haven't seen it pop up in cases where we suspect bot spikes, but these could just be background non-spiky and still be a problem. The whole bot detection discussion seems to have everyone overwhelmed, and has the whole time I've been here. I think it would take a good cross-org effort and we're not great at that yet.

I had missed that this had been tackled.
The new numbers make a lot morse sense to me. Thank you for the work @Milimetric !