Page MenuHomePhabricator

Browser Reports. Break down "Other" bucket a little more?
Closed, DeclinedPublic

Description

Our number 3 browser grouping (after Chrome and Mobile Safari) is "Other", which is quite large at 13% of traffic. Possibly a little too high-level to help. :-)

https://browser-reports.wmflabs.org/#all-sites-by-browser/browser-family-and-major-hierarchical-view

  • Chrome 24%
  • Mobile Safari 16%
  • Other 13%
  • IE 11%
  • Firefox 9.2%

Event Timeline

Let me explain why this is happening.

We are cutting the long tail for any combination of dimensions smaller than 0.05

See:https://github.com/wikimedia/analytics-refinery/blob/master/oozie/browser/general/coordinator.properties#L62
and: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/browser/general/browser_general.hql

An example, three records like the following which are part of the long tail:

Chrome 43 windows 11 0.03%
Chrome 42 linux 0.01%
Safari ios8 0.02%

are being translated into:

Other Other 0.06%

And progressively so.

There are some improvements that we are going to do regarding percentage calculations of this data but note that cutting a long tail of 0.05% doesn't imply that your Other bucket will be small, it really depends on the dimensions you are keeping and the diversity of your data.

Let's wait and see whether the other bucket is truly too big to have actionable data.

Notes for self:

The way to fix this issue is to do an intermediate calculation for os and another one for browsers and cut longtail independently in either. Drawbacks are more complex storage (1 table per UI split) .

Also, the longer your time interval you calculate percentages on (monthly versus daily) it is likely that will make your Other bucket bigger, not smaller.

Another action we can take is anonymizing in a less aggressive way:
Today, when we find a bucket that is too small, we rewrite all the dimensions of the bucket as 'other'.
This is an easier approach, but could be less aggresive. Maybe by anonymizing just 1 of the record's dimensions would be enough for the bucket to cease being too small.
So, we could do the same approach as we did with the pageview_hourly table.
This would mean, though, transforming the hive job that creates the intermediate browser_general into a spark job, which we know is loonger to develop.

This would mean, though, transforming the hive job that creates the intermediate browser_general into a spark job, which we know is loonger to develop.

Right, anonymizing per dimension rather than record will work better but still the other bucket might be sizable.

Nuria removed the point value for this task.Apr 18 2016, 4:45 PM
Nuria raised the priority of this task from Low to Medium.Apr 18 2016, 4:47 PM
Nuria edited projects, added Analytics; removed Analytics-Kanban.
Nuria moved this task from Incoming to Dashiki on the Analytics board.

An easier approach to fix this issue per @mforns suggestion.

Let's create two tables one OS based and one browser family based, that way we can use our current anonymization thresholds but the number of dimensions in each table is smaller giving us more detailed reports.
If we do it this way we can populate the new tables while maintaning the old one and once data is backfilled we can swap them and delete the old table that has browse family and OS together.

Nuria renamed this task from Break down "Other" a little more? to Browser Reports. Break down "Other" bucket a little more?.Jul 31 2017, 4:59 PM
odimitrijevic added a subscriber: odimitrijevic.

The deprecation of User Agent header and possible move to User Headers will prompt a reworking of browser categorization.