This task captures the work to re-run the queries in T176467#3658144 including OS percentages.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • JKatzWMF | T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service | |||
Resolved | ovasileva | T176467 [Spike 3hrs] Which browsers are downloading PDFs? | |||
Resolved | ovasileva | T177969 [Spike 2h] Which browsers are downloading PDFs (including OS percentages)? |
Event Timeline
Using the query I got a days worth of data. Not clear if we need more data than that.
select user_agent_map["os_family"] as os_family, count(*) as c from webrequest where year=2017 AND month=9 AND day=26 AND uri_query LIKE "%action=redirect-to-electron%" group by user_agent_map["os_family"] sort by c desc;
os_family | total | % of total |
Other | 2626 | 33.56% |
Windows 7 | 1691 | 21.61% |
Windows 10 | 1602 | 20.47% |
Mac OS X | 547 | 6.99% |
Windows XP | 438 | 5.60% |
Windows 8.1 | 405 | 5.18% |
Windows Vista | 160 | 2.04% |
Windows | 70 | 0.89% |
Android | 69 | 0.88% |
Linux | 55 | 0.70% |
Windows 98 | 54 | 0.69% |
Windows 8 | 30 | 0.38% |
Chrome OS | 27 | 0.35% |
iOS | 19 | 0.24% |
Ubuntu | 11 | 0.14% |
Windows CE | 9 | 0.12% |
Windows 2000 | 4 | 0.05% |
Linux Mint | 3 | 0.04% |
Windows 95 | 2 | 0.03% |
Bada | 1 | 0.01% |
BlackBerry OS | 1 | 0.01% |
Symbian OS | 1 | 0.01% |
I'm not 100% sure what kind of traffic classes as "other" - but it looks like a large degree of it is bots from a quick browse.
The time frame I chose isn't exhaustive but here's a couple of examples of what "other" OS and/or browser families are. Note well that I'm excluding known bots and spiders from the result set.
+---------------------------------+-------+ | user_agent | n | +---------------------------------+-------+ | MyApp/0.1 | 5412 | | Mozilla/4.0 (compatible;) | 1 | | 12345 | 8 | | abc | 1 | | HTTrack/3.0x | 2 | | Mozilla/5.0 | 1 | | Yandex DocViewer | 2 | | Dorado WAP-Browser/1.0.0 | 4 | | UNTRUSTED/1.0/MAUI WAP Browser | 4 | +---------------------------------+-------+ select user_agent, count(*) as n from wmf.webrequest where year = 2017 and month = 10 and day = 11 and hour >= 0 and hour < 12 and webrequest_source = 'text' and uri_path like "%rest_v1/page/pdf%" # Exclude known bots and spiders. and agent_type = "user" and user_agent_map["os_family"] = "Other" and user_agent_map["browser_family"] = "Other" group by user_agent;
@ovasileva: MyApp. The User-Agent header sent by the client can be anything you want it to be if you're so inclined.
Edit: I've certainly done the same in my older hobby projects. My guess would be that that's what this is, a hobby project.
Thanks for running that query, @Jdlrobson.
How did you calculate the percentages on a per-row basis? Per your comment in the parent task, T176467#3640968, the uri_path we're searching for is like '%rest_v1/page/pdf%'.
I ask because I hadn't done this in one query in HiveQL before and I wanted to see if I could. Fortunately, @Milimetric helped me out and I arrived at the following:
https://pythonhosted.org/feedparser/http-useragent.html may be responsible for the user agent. People love to copy and paste examples :)
As for calculating percentages I did those manually in a spreadsheet.
We should get this copied on to https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality?veaction=editsource#Userbase
How's this?
https://www.mediawiki.org/w/index.php?title=Reading%2FWeb%2FPDF_Functionality&type=revision&diff=2586749&oldid=2579845
Can someone check and resolve?
@phuedx Thanks for documenting the query used (T177969#3687130 )! Can you also specify the timespan for which it was ran? (I.e. the concrete values of M, N, O, P and Q.) I re-ran it for a different timespan - October 31 - and got quite different results. E.g. 38.15% for Chrome 61 on Windows 10 instead of 8.83%, but only 6.55% for "Other" instead of 14.38%, etc. This may be because the total number of downloads in the timespan used was too low, and hence the statistical error (random variation) too large.
-- adapted from https://phabricator.wikimedia.org/P6147 : !set outputformat table set hive.mapred.mode=nonstrict; with pdf_requests as ( select user_agent_map["os_family"] as os_family, user_agent_map["browser_family"] as browser_family, user_agent_map["browser_major"] as browser_major, count(*) as n from wmf.webrequest where year = 2017 and month = 10 and day = 31 and webrequest_source = 'text' and uri_path like "%rest_v1/page/pdf%" and agent_type = "user" group by user_agent_map["os_family"], user_agent_map["browser_family"], user_agent_map["browser_major"] ), total_pdf_requests as ( select sum(n) as total from pdf_requests ) select os_family, browser_family, browser_major, round(n / total * 100, 2) as pct from total_pdf_requests cross join ( select * from pdf_requests order by n desc limit 50 ) as pdf_requests_top_50 order by pct desc ; +----------------+-----------------+----------------+--------+--+ | os_family | browser_family | browser_major | pct | +----------------+-----------------+----------------+--------+--+ | Windows 10 | Chrome | 61 | 38.15 | | Windows 7 | Chrome | 61 | 8.05 | | Other | Other | - | 6.55 | | Windows 7 | Firefox | 56 | 4.73 | | Windows 10 | Firefox | 56 | 4.28 | | Windows 7 | IE | 11 | 3.66 | | Windows XP | Chrome | 49 | 1.62 | | Mac OS X | Safari | 11 | 1.58 | | Windows 8.1 | Chrome | 61 | 1.45 | | Windows 10 | Edge | 15 | 1.45 | | Windows XP | Firefox | 52 | 1.35 | | Windows 10 | Chrome | 62 | 1.26 | | Windows 7 | Chrome | 62 | 1.26 | | Windows 10 | IE | 11 | 1.18 | | Windows 10 | Chrome | 59 | 1.03 | | Windows 8.1 | Firefox | 56 | 0.92 | | Windows 8.1 | IE | 11 | 0.8 | | Windows 10 | Firefox | 53 | 0.77 | | Windows 10 | Edge | 14 | 0.63 | | Windows 7 | Yandex Browser | 17 | 0.58 | | Mac OS X | Chrome | 61 | 0.48 | | Windows 7 | Opera | 48 | 0.48 | | Ubuntu | Firefox | 56 | 0.47 | | Android | UC Browser | 11 | 0.42 | | Android | Chrome Mobile | 61 | 0.35 | | Windows 8.1 | Chrome | 62 | 0.34 | | Windows 10 | Opera | 48 | 0.33 | | Windows Vista | Firefox | 52 | 0.33 | | Mac OS X | Firefox | 56 | 0.3 | | Windows 8 | Chrome | 61 | 0.29 | | Windows 7 | Chrome | 43 | 0.28 | | Mac OS X | Safari | 10 | 0.27 | | Windows 7 | Firefox | 52 | 0.27 | | Windows 7 | Chrome | 60 | 0.26 | | Windows 10 | Edge | 16 | 0.25 | | Windows XP | IE | 6 | 0.24 | | Android | Android | 4 | 0.21 | | Ubuntu | Chromium | 57 | 0.2 | | Windows 7 | Chrome | 55 | 0.2 | | Windows 8.1 | Chrome | 55 | 0.19 | | Windows Vista | Firefox | 43 | 0.18 | | Windows 8 | Firefox | 56 | 0.18 | | Linux | Firefox | 52 | 0.17 | | Linux | Chrome | 61 | 0.17 | | Windows XP | Chrome | 46 | 0.16 | | Windows 10 | Edge | 13 | 0.16 | | Windows 8.1 | Chrome | 58 | 0.16 | | Windows 7 | Chrome | 59 | 0.16 | | Mac OS X | Firefox | 48 | 0.16 | | Android | Opera Mini | 30 | 0.16 | +----------------+-----------------+----------------+--------+--+ 50 rows selected (341.405 seconds)
Thanks, @Tbayer. The timespan I used was 0:00 - 12:00 on 2017/10/11, i.e.
year = 2017 and month = 10 and day = 11 and hour >= 0 and hour < 12
I've since removed the result set from that paste so that it's generic, so I'll reproduce it below for posterity.
+----------------+-----------------+----------------+--------+ | os_family | browser_family | browser_major | pct | +----------------+-----------------+----------------+--------+ | Other | Other | - | 14.38 | | Windows 7 | Chrome | 61 | 12.42 | | Windows 10 | Chrome | 61 | 8.83 | | Windows 7 | IE | 11 | 7.33 | | Windows 7 | Firefox | 56 | 6.59 | | Windows 10 | Firefox | 56 | 3.82 | | Windows 10 | Edge | 15 | 3.24 | | Windows 8.1 | Chrome | 61 | 3.07 | | Windows XP | Chrome | 49 | 2.2 | | Windows 10 | Chrome | 59 | 1.53 | | Windows 10 | IE | 11 | 1.51 | | Windows 8.1 | Firefox | 56 | 1.31 | | Windows XP | Firefox | 52 | 1.22 | | Windows 8 | Chrome | 61 | 1.15 | | Windows 8.1 | IE | 11 | 1.15 | | Mac OS X | Safari | 11 | 0.9 | | Windows 7 | Firefox | 53 | 0.89 | | Windows 7 | Firefox | 52 | 0.78 | | Ubuntu | Firefox | 56 | 0.78 | | Windows XP | IE | 6 | 0.7 | | Windows 7 | Chrome | 55 | 0.68 | | Windows 7 | Firefox | 55 | 0.62 | | Mac OS X | Chrome | 61 | 0.62 | | Android | UC Browser | 11 | 0.6 | | Windows 10 | Edge | 14 | 0.59 | | Windows 7 | Opera | 48 | 0.53 | | Android | Chrome Mobile | 61 | 0.49 | | Windows 10 | Opera | 48 | 0.44 | | Windows 7 | Chrome | 60 | 0.4 | | Windows Vista | Chrome | 49 | 0.39 | | Windows 7 | Yandex Browser | 17 | 0.37 | | Windows 10 | Firefox | 55 | 0.37 | | Mac OS X | Safari | 10 | 0.36 | | Windows 10 | Chrome | 50 | 0.34 | | Android | Android | 4 | 0.33 | | Mac OS X | Firefox | 56 | 0.33 | | Windows 10 | Chrome | 60 | 0.32 | | Windows 8.1 | Chrome | 43 | 0.3 | | Android | Amazon Silk | 60 | 0.29 | | Windows 7 | Sogou Explorer | 1 | 0.27 | | Windows 8 | IE | 10 | 0.26 | | Windows 7 | IE | 8 | 0.26 | | Windows 8 | Opera | 12 | 0.25 | | Windows 7 | IE | 9 | 0.25 | | Linux | Firefox | 52 | 0.25 | | Mac OS X | Firefox | 53 | 0.24 | | Windows 7 | Firefox | 45 | 0.24 | | Windows 10 | Firefox | 57 | 0.24 | | Windows 7 | Firefox | 38 | 0.22 | | Windows 10 | Firefox | 47 | 0.21 | +----------------+-----------------+----------------+--------+
Thanks @phuedx! So I think this discrepancy shows it's better to rely on a longer timespan. Below, I have extended it to four weeks (Oct 1-28), which is actually OK to do provided one is content to relegate one's query to new "nice" queue on Hive (and prepared to wait a bit longer in case there are more timely queries running in the normal, non-nice queue - but even so, this query only took less than two hours for these four weeks' worth of data).
T176467#3658144, which gave rise to this task, also included the suggestion to compare the PDF download frequencies to the analogous frequencies from our general traffic, which seems a better test of the hypothesis that older browser are more likely to download PDFs (see also the original task description at T176467 ). I have done this below: pdf_pct is the same percentage as above, pv_pct is the percentage among pageviews in general, and pdf_propensity is the quotient of these two, measuring how much more likely users with that browser/OS combination are to download PDFs, compared with a general reader.
E.g. readers using the then-current version (61) of Chrome on Android seem to be remarkably unlikely to download PDFs, even when compared to other Android browsers like UC Browser.
os_family | browser_family | browser_major | pdf_downloads | pdf_pct | pv_pct | pdf_propensity |
---|---|---|---|---|---|---|
Windows 10 | Chrome | 61 | 412779 | 14.92 | 8.29 | 1.8 |
Other | Other | - | 329230 | 11.9 | 1.08 | 11.05 |
Windows 7 | Chrome | 61 | 320450 | 11.58 | 6.47 | 1.79 |
Windows 7 | Firefox | 56 | 147262 | 5.32 | 1.61 | 3.31 |
Windows 7 | IE | 11 | 136150 | 4.92 | 2.81 | 1.75 |
Windows 10 | Firefox | 56 | 118543 | 4.28 | 1.84 | 2.33 |
Windows 8.1 | Chrome | 61 | 68286 | 2.47 | 1.51 | 1.63 |
Windows 10 | Edge | 15 | 62190 | 2.25 | 1.37 | 1.64 |
Windows XP | Chrome | 49 | 41995 | 1.52 | 0.43 | 3.51 |
Windows XP | Firefox | 52 | 36140 | 1.31 | 0.23 | 5.65 |
Windows 10 | IE | 11 | 36084 | 1.3 | 1.09 | 1.19 |
Windows 10 | Chrome | 59 | 34859 | 1.26 | 0.04 | 29.63 |
Windows 8.1 | IE | 11 | 30582 | 1.11 | 0.6 | 1.85 |
Windows 7 | Firefox | 55 | 30602 | 1.11 | 0.34 | 3.27 |
Windows 8.1 | Firefox | 56 | 28305 | 1.02 | 0.35 | 2.93 |
Mac OS X | Safari | 11 | 28002 | 1.01 | 1.21 | 0.84 |
Windows 10 | Edge | 14 | 25656 | 0.93 | 0.39 | 2.38 |
Windows 10 | Firefox | 55 | 21301 | 0.77 | 0.36 | 2.15 |
Mac OS X | Chrome | 61 | 20958 | 0.76 | 1.9 | 0.4 |
Windows 7 | Chrome | 58 | 19489 | 0.7 | 0.12 | 5.89 |
Ubuntu | Firefox | 56 | 17439 | 0.63 | 0.17 | 3.78 |
Windows 7 | Opera | 48 | 17197 | 0.62 | 0.31 | 1.99 |
Windows 8 | Chrome | 61 | 16166 | 0.58 | 0.22 | 2.62 |
Windows 7 | Chrome | 62 | 14995 | 0.54 | 0.2 | 2.75 |
Android | Chrome Mobile | 61 | 14525 | 0.52 | 11.82 | 0.04 |
Windows 10 | Opera | 48 | 13902 | 0.5 | 0.3 | 1.67 |
Windows 7 | Firefox | 52 | 13290 | 0.48 | 0.22 | 2.23 |
Windows 7 | Chrome | 50 | 13191 | 0.48 | 0.03 | 15.05 |
Android | UC Browser | 11 | 13217 | 0.48 | 0.51 | 0.93 |
Windows 10 | Chrome | 50 | 13395 | 0.48 | 0.6 | 0.8 |
Windows 7 | Chrome | 60 | 12765 | 0.46 | 0.28 | 1.62 |
Mac OS X | Safari | 10 | 11969 | 0.43 | 0.76 | 0.57 |
Windows 7 | Chrome | 52 | 11594 | 0.42 | 0.1 | 4.08 |
Mac OS X | Firefox | 56 | 10688 | 0.39 | 0.28 | 1.36 |
Windows 10 | Firefox | 53 | 10392 | 0.38 | 0.02 | 22.13 |
Windows 10 | Chrome | 62 | 10387 | 0.38 | 0.27 | 1.38 |
Windows 7 | Chrome | 43 | 10414 | 0.38 | 0.02 | 17.03 |
Windows 7 | Firefox | 57 | 9998 | 0.36 | 0.03 | 11.52 |
Android | Android | 4 | 9429 | 0.34 | 0.92 | 0.37 |
Windows Vista | Firefox | 52 | 9141 | 0.33 | 0.07 | 4.8 |
Windows 7 | Firefox | 45 | 9115 | 0.33 | 0.07 | 4.97 |
Windows XP | IE | 6 | 8762 | 0.32 | 0.53 | 0.6 |
Windows 10 | Chrome | 60 | 7712 | 0.28 | 0.28 | 1.0 |
Windows 10 | Edge | 13 | 7111 | 0.26 | 0.07 | 3.52 |
Windows 8.1 | Firefox | 55 | 7071 | 0.26 | 0.08 | 3.28 |
Windows 7 | Yandex Browser | 17 | 6592 | 0.24 | 0.21 | 1.15 |
Windows CE | IE | 4 | 6673 | 0.24 | 0.15 | 1.56 |
Android | Android | 2 | 6557 | 0.24 | 0.18 | 1.33 |
Windows 7 | Chrome | 55 | 6250 | 0.23 | 0.08 | 2.83 |
Linux | Chrome | 61 | 5948 | 0.21 | 0.14 | 1.55 |
Query used:
SET mapred.job.queue.name=nice; set hive.mapred.mode=nonstrict; with pdf_requests as ( select user_agent_map['os_family'] as os_family, user_agent_map['browser_family'] as browser_family, user_agent_map['browser_major'] as browser_major, count(*) as n from wmf.webrequest where year = 2017 and month = 10 and day <= 28 and webrequest_source = 'text' and uri_path like '%rest_v1/page/pdf%' and agent_type = 'user' group by user_agent_map['os_family'], user_agent_map['browser_family'], user_agent_map['browser_major'] ), total_pdf_requests as ( select sum(n) as total from pdf_requests ), pageviews AS ( select user_agent_map['os_family'] as os_family, user_agent_map['browser_family'] as browser_family, user_agent_map['browser_major'] as browser_major, SUM(view_count) AS PVs from wmf.pageview_hourly where year = 2017 and month = 10 and day <= 28 and agent_type = 'user' group by user_agent_map['os_family'], user_agent_map['browser_family'], user_agent_map['browser_major'] ), total_pageviews AS ( select sum(PVs) AS total_PVs FROM pageviews ) SELECT pdf_requests_top_50.os_family AS os_family, pdf_requests_top_50.browser_family AS browser_family, pdf_requests_top_50.browser_major AS browser_major, n AS PDF_downloads, round(n / total * 100, 2) as PDF_pct, ROUND(PVs / total_PVs * 100, 2) AS PV_pct, ROUND( (n / total) / (PVs / total_PVs) , 2) AS PDF_propensity from total_pdf_requests cross join ( select * from pdf_requests order by n desc limit 50 ) as pdf_requests_top_50 JOIN total_pageviews CROSS JOIN pageviews ON pdf_requests_top_50.os_family = pageviews.os_family AND pdf_requests_top_50.browser_family = pageviews.browser_family AND pdf_requests_top_50.browser_major = pageviews.browser_major order by PDF_pct desc ;