Page MenuHomePhabricator

Analyze data differences between `access_method` derived from URL and from x-analytics
Closed, ResolvedPublic

Description

The webrequest x_analytics field now provides the ismobile field to remove the .m url part in the near future.
This task is about analyzing and discussing differences in using both methods to derive the access_method for webrequest data.

Event Timeline

First analysis and results:

select 
   webrequest_source,
   access_method,
   x_analytics_map['ismobile'],
   is_pageview,
   is_redirect_to_pageview,
   count(1)
from wmf.webrequest
where year= 2025
  and month = 8
  and day = 6
group by
  webrequest_source,
  access_method,
  x_analytics_map['ismobile'],
  is_pageview,
  is_redirect_to_pageview
order by
  webrequest_source,
   access_method,
   x_analytics_map['ismobile'],
   is_pageview,
   is_redirect_to_pageview
;

Results with percentage of grand total and comment additions:

webrequest_sourceaccess_methodx_analytics_map[ismobile]is_pageviewis_redirect_to_pageviewcount(1)manually_added_percentagecomment
textdesktopNULLFALSEFALSE402055380234.25%
textdesktopNULLFALSETRUE3948818563.36%
textdesktopNULLTRUEFALSE4131848573.52%
textdesktop1FALSEFALSE10.00%capital M in .m URL not normalized in webrequest - correctness improvement :).
textdesktop1TRUEFALSE60.00%capital M in .m URL not normalized in webrequest - correctness improvement :).
textmobile appNULLFALSEFALSE3965944273.38%
textmobile appNULLTRUEFALSE92799610.08%
textmobile app1FALSEFALSE17961220.02%Not important, webrequest categorisation won't change. Interesting to understand though.
textmobile app1TRUEFALSE436710.00%Not important, webrequest categorisation won't change. Interesting to understand though.
textmobile webNULLFALSEFALSE266992020.23%Rows missing the ismobile for exact match. Not pageviews nor redirect_to_oageview so no impact on unique_devices nor pageview metrics.
textmobile webNULLFALSETRUE13389760.01%Rows missing the ismobile for exact match. redirect_to_pageview used only in unique_devices_per_project_family which doesn't split by access_method. No impact on unique_devices nor pageview metrics.
textmobile web1FALSEFALSE252418228021.50%
textmobile web1FALSETRUE1209186521.03%
textmobile web1TRUEFALSE3795042623.23%Perfect match - awesome :)
uploaddesktopNULLFALSEFALSE332535545228.33%
uploaddesktop1FALSEFALSE6731140.01%I guess those rows have more precise qualification :)
uploadmobile appNULLFALSEFALSE1236685141.05%
uploadmobile webNULLFALSEFALSE3990.00%Only non-wiki hosts with .m qualifier.

The data looks good enough IMO to change the underlying access_method algorithm and update the unique_devices_per_domain job.
I'll start doing that.