Page MenuHomePhabricator
Paste P6054

Top 50 browsers requesting /api/rest_v1/page/pdf (excluding bots and spiders) between 0:00 - 12:00 on 2017/09/26
ArchivedPublic

Authored by phuedx on Sep 28 2017, 12:45 PM.
+----------------+-----------+-----------+-----------------+----------------+----------------+-------+
| os_family | os_major | os_minor | browser_family | browser_major | browser_minor | n |
+----------------+-----------+-----------+-----------------+----------------+----------------+-------+
| Windows 7 | - | - | Chrome | 60 | NULL | 2931 |
| Windows 7 | - | - | Firefox | 55 | NULL | 2773 |
| Windows 7 | - | - | IE | 11 | NULL | 2229 |
| Windows 10 | - | - | Chrome | 60 | NULL | 2145 |
| Windows 10 | - | - | Firefox | 55 | NULL | 1596 |
| Windows 7 | - | - | Chrome | 61 | NULL | 1470 |
| Windows 10 | - | - | Chrome | 59 | NULL | 1318 |
| Windows 10 | - | - | Chrome | 61 | NULL | 1239 |
| Windows XP | - | - | Chrome | 49 | NULL | 1213 |
| Windows 8.1 | - | - | Firefox | 55 | NULL | 681 |
| Windows 8 | - | - | Chrome | 50 | NULL | 648 |
| Windows 8.1 | - | - | Chrome | 60 | NULL | 628 |
| Windows 10 | - | - | Edge | 15 | NULL | 572 |
| Windows 10 | - | - | IE | 11 | NULL | 547 |
| Windows XP | - | - | Firefox | 52 | NULL | 486 |
| Windows 8.1 | - | - | IE | 11 | NULL | 449 |
| Other | - | - | UC Browser | 9 | NULL | 335 |
| Windows 7 | - | - | Chrome | 58 | NULL | 323 |
| Windows 7 | - | - | Opera | 47 | NULL | 299 |
| Windows 8.1 | - | - | Chrome | 61 | NULL | 296 |
| Windows 7 | - | - | Firefox | 42 | NULL | 290 |
| Mac OS X | 10 | 12 | Safari | 11 | NULL | 272 |
| Windows Vista | - | - | Firefox | 52 | NULL | 254 |
| Mac OS X | 10 | 12 | Chrome | 60 | NULL | 248 |
| Windows 7 | - | - | Firefox | 52 | NULL | 230 |
| Ubuntu | - | - | Firefox | 55 | NULL | 230 |
| Windows XP | - | - | Iron | 45 | NULL | 206 |
| Windows 7 | - | - | Firefox | 3 | NULL | 185 |
| Windows 10 | - | - | Opera | 47 | NULL | 176 |
| Android | 5 | 0 | Firefox Mobile | 55 | NULL | 173 |
| Windows 10 | - | - | Edge | 14 | NULL | 170 |
| Android | 5 | 1 | Chrome Mobile | 49 | NULL | 140 |
| Windows Vista | - | - | Chrome | 49 | NULL | 137 |
| Windows 7 | - | - | IE | 10 | NULL | 134 |
| Windows 7 | - | - | IE | 9 | NULL | 133 |
| Windows 7 | - | - | Firefox | 51 | NULL | 128 |
| Mac OS X | 10 | 12 | Firefox | 55 | NULL | 128 |
| Linux | - | - | Firefox | 52 | NULL | 120 |
| Windows 7 | - | - | Firefox | 45 | NULL | 118 |
| Windows 7 | - | - | Firefox | 54 | NULL | 113 |
| Mac OS X | 10 | 12 | Safari | 10 | NULL | 109 |
| Windows XP | - | - | IE | 6 | NULL | 108 |
| Windows 7 | - | - | Sogou Explorer | 1 | NULL | 108 |
| Windows 7 | - | - | Chrome | 56 | NULL | 105 |
| Windows 7 | - | - | Chrome | 55 | NULL | 103 |
| Windows 7 | - | - | Chrome | 43 | NULL | 101 |
| Mac OS X | 10 | 12 | Firefox | 53 | NULL | 98 |
| Android | 5 | 0 | Chrome Mobile | 44 | NULL | 98 |
| Windows 7 | - | - | Firefox | 56 | NULL | 95 |
| Windows 10 | - | - | Chrome | 50 | NULL | 84 |
+----------------+-----------+-----------+-----------------+----------------+----------------+-------+
!set outputformat table
select
user_agent_map["os_family"] as os_family,
user_agent_map["os_major"] as os_major,
user_agent_map["os_minor"] as os_minor,
user_agent_map["browser_family"] as browser_family,
user_agent_map["browser_major"] as browser_major,
user_agent_map["browser_minor"] as browser_minor,
count(*) as n
from
# Analytics Engineering provide a detailed description of wmf.webrequest and
# wmf_raw.webrequest here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest
wmf.webrequest
where
# PROTIP: elukey and Joal from #wikimedia-analytics recommend starting small
# and increasing the scope of the query until you're satisfied with your
# results - start with an hour, then a couple of hours, then a day, and
# finally a week.
#
# This is because year, month, day, and hour are all Hive partitions that
# map to hourly imports of Varnish logs in HDFS. By constraining the time
# range for your query, you minimise the number of imports that the cluster
# will load, parse, and process to produce results.
#
# NOTE: You should ask in #wikimedia-analytics before analysing a month's
# worth of data as the size of the initial dataset is ~35 TB (!).
year = 2017
and month = 9
and day = 26
and hour >= 0
and hour < 12
# The Varnish cluster that handled the request (see https://wikitech.wikimedia.org/wiki/Varnish#Cache_Clusters).
#
# PROTIP: elukey and Joal also recommend selecting a source whenever possible
# as it's a Hive partition. Doing so can reduce the size of the initial
# dataset by a factor of ~2.
and webrequest_source = 'text'
and uri_path like "%rest_v1/page/pdf%"
# Exclude known bots and spiders.
and agent_type = "user"
group by
# https://issues.apache.org/jira/browse/HIVE-1683
user_agent_map["os_family"],
user_agent_map["os_major"],
user_agent_map["os_minor"],
user_agent_map["browser_family"],
user_agent_map["browser_major"],
user_agent_map["browser_minor"]
sort by
n desc
limit 50;

Event Timeline

phuedx changed the title of this paste from Top 50 browsers requesting /api/rest_v1/page/pdf (excluding known bots or spiders) to Top 50 browsers requesting /api/rest_v1/page/pdf (excluding known bots or spiders) between 0:00 - 12:00 on 2017/09/26.

Draft query to introduce percentages to the output:

!set outputformat table;

set hive.mapred.mode=nonstrict;

with pdf_requests as (
  select
    user_agent_map["os_family"] as os_family,
    user_agent_map["browser_family"] as browser_family,
    count(*) as n
  from
    wmf.webrequest
  where
    year = 2017
    and month = 10
    and day = 11
    and hour = 1
    and webrequest_source = 'text'
    and uri_path like "%rest_v1/page/pdf%"
    and agent_type = "user"
  group by
    user_agent_map["os_family"],
    user_agent_map["browser_family"]
),
total_pdf_requests as (
  select
    sum(n) as total
  from
    pdf_requests
)

select
  pdf_requests_top_50.os_family as os_family,
  pdf_requests_top_50.browser_family as browser_family,
  pdf_requests_top_50.n as n,
  round(pdf_requests_top_50.n / total_pdf_requests.total * 100, 2) as pct
from
  total_pdf_requests
cross join (
  select
    *
  from
    pdf_requests
  order by
    n desc
  limit 50
) as pdf_requests_top_50
order by
  n desc

;
phuedx changed the title of this paste from Top 50 browsers requesting /api/rest_v1/page/pdf (excluding known bots or spiders) between 0:00 - 12:00 on 2017/09/26 to Top 50 browsers requesting /api/rest_v1/page/pdf (excluding bots and spiders) between 0:00 - 12:00 on 2017/09/26.Oct 14 2017, 1:14 PM
phuedx edited the content of this paste. (Show Details)