Page MenuHomePhabricator

[Spike 3hrs] Which browsers are downloading PDFs?
Closed, ResolvedPublic

Description

We know how many times the "Download as PDF" link is clicked per day and how many PDFs are downloaded per day, but we don't know which browsers users are using to make these requests. Let's test the hypothesis that users are using older browsers with little or no PDF support to download PDFs.

This data should be available in [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#wmf.webrequest | the wmf.webrequest table ]].

Outcomes

  • We understand and document the range of browsers that are being used to download PDFs rendered by the Electron render service.

@phuedx: See T176467#3642849 and https://www.mediawiki.org/w/index.php?title=Reading%2FWeb%2FPDF_Functionality&type=revision&diff=2573674&oldid=2571758.

Notes

  1. AFAIK the PDF download URL is /wiki/Special:ElectronPdf?action=redirect-to-electron

@phuedx: In T176467#3640968, @Jdlrobson points out that it's /api/rest_v1/page/pdf.

Event Timeline

ovasileva moved this task from Incoming to Upcoming on the Web-Team-Backlog board.

Would it be possible to add a little more detail to this task? I'm not sure if it's quite ready to work on. I'm intentionally a bit out of the loop on OCG so silly questions ahead:

  • Do we already track browser info in a user-agent header or will additional instrumentation need to be implemented as part of this task? (I'm trying to understand if the scope of this task is analyze the data we have or to figure out what we need to instrument.)
  • Where should the results be published? Just here on the ticket, a wiki, a Google Doc, a mailing list, or somewhere else?
  • Should this task be a subtask of one of the OCG or print style epics?
  • Does a dashboard for this info already exist or is this something we mine from the EventLogging database directly?

Thanks, @Niedzielski! I think I've answered all of your questions in T176467#3631394 and T176467#3631472.

phuedx updated the task description. (Show Details)
ovasileva renamed this task from [Spike] Which browsers are downloading PDFs? to [Spike 3hrs] Which browsers are downloading PDFs?.Sep 26 2017, 4:36 PM

I've looked at a day's and months worth of data.

Most downloads are coming from Firefox and Chrome (in particular 60) so I don't think the hypothesis "that users are using older browsers with little or no PDF support to download PDFs". What's more likely is that user's are on slow connections [1] or right clicking and opening in new tab when they print. [ignore was getting confused about printable version]

A significant amount of downloads seem to happen from bot/crawlers.

Method for collecting data:

ssh stat1005.wikimedia.org
hive --database wmf
describe webrequest;
select user_agent, count(*) as c from webrequest where year=2017 AND month=9 AND day=26 AND uri_query LIKE "%action=redirect-to-electron%" group by user_agent sort by c desc;
select user_agent, count(*) as c from webrequest where year=2017 AND month=9 AND uri_query LIKE "%action=redirect-to-electron%" group by user_agent sort by c desc;

Results

Data collected for today and September.
https://docs.google.com/spreadsheets/d/1VoWUY2oSoLN-em18qjVy7MGf6zITWDZQEjBds_HHwr4/edit?usp=sharing

[1] Note, when our desktop JS does not finish loading or is interrupted user's will click the print to pdf and be taken to the ElectronPdf interface instead of the JS browser print. (JS is required to convert the link into a browser print button).

Neato!

I've been able to get Beeline output nicely formatted tabular data before with

!set outputformat table

For example:

[0]
+-------------+-----------+-----------+--------------------+----------------+----------------+--------+
|  os_family  | os_major  | os_minor  |   browser_family   | browser_major  | browser_minor  |   n    |
+-------------+-----------+-----------+--------------------+----------------+----------------+--------+
| Windows 10  | -         | -         | Chrome             | 50             | NULL           | 21222  |
| Other       | -         | -         | Googlebot          | 2              | NULL           | 9442   |
| Other       | -         | -         | YandexBot          | 3              | NULL           | 8645   |
| Other       | -         | -         | Scrapy             | 1              | NULL           | 7946   |
| Windows 7   | -         | -         | Chrome             | 34             | NULL           | 7739   |
| Windows 7   | -         | -         | Chrome             | 60             | NULL           | 6012   |
| Other       | -         | -         | Other              | -              | NULL           | 5772   |
| Windows 10  | -         | -         | Chrome             | 60             | NULL           | 5527   |
| Windows XP  | -         | -         | IE                 | 6              | NULL           | 5516   |
| Other       | -         | -         | Apache-HttpClient  | 4              | NULL           | 4682   |
+-------------+-----------+-----------+--------------------+----------------+----------------+--------+

select
  user_agent_map["os_family"] as os_family,
  user_agent_map["os_major"] as os_major,
  user_agent_map["os_minor"] as os_minor,
  user_agent_map["browser_family"] as browser_family,
  user_agent_map["browser_major"] as browser_major,
  user_agent_map["browser_minor"] as browser_minor,
  count(*) as n
from wmf.webrequest
where
  year = 2017
  and month = 9  
  and uri_query like "%action=redirect-to-electron%"

  # PROTIP: elukey and Joal from #wikimedia-analytics notified me
  # that selecting a source (if possible) can cut the size of the
  # initial dataset in two.
  #and webrequest_source = 'text'

group by

  # https://issues.apache.org/jira/browse/HIVE-1683
  user_agent_map["os_family"],
  user_agent_map["os_major"],
  user_agent_map["os_minor"],
  user_agent_map["browser_family"],
  user_agent_map["browser_major"],
  user_agent_map["browser_minor"]
sort by
  n desc
limit 10;

In future, I'll drop the os_{major,minor} and browser_minor columns.

Regarding querying results for week or month periods: I've just been advised in analytics-engineering that we should approach them before running queries on that size of data set. I was told that the size the initial data set for [0] in T176467#3639268 (the wmf.webrequest for the month of September) is ~35 TB and mapping/reducing that amount of data starts to cause issues 😐

Wait.. Isn't the PDF download URL /api/select user_agent, count(*) as c from webrequest where year=2017 AND month=9 AND day=26 AND uri_path LIKE '%rest_v1/page/pdf%';/(title) ?
I can't find where generatePdfDownloadLink hook is used (should it be removed?!).

That would explain things... Don't we need to use a query like:

select uri_path as c from webrequest where year=2017 AND month=9 AND day=26 AND uri_path LIKE '%rest_v1/page/pdf%' limit 1;

(??)

If so the query would be:

select user_agent_map, count(*) as c from webrequest where year=2017 AND month=9 AND day=26 AND uri_path LIKE '%rest_v1/page/pdf%' group by user_agent_map sort by c desc;

I have been scared to kick of this off as I'm

  • 1) I understand correctly what we're trying to measure.
  • 2) Not sure what the process (if any other than an IRC ping) is needed to run this query (per above)

@Jdlrobson: The numbers from Pivot point to that being the correct URL: https://pivot.wikimedia.org/#webrequest/totals/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkdmeANroVazsApU6jFmw55uOfETKUlxpq3adLvAUNHj8IhUt3OLZVUA/BkxACVicgBzcSUAEwBXBmI9XgAFAEYAESpmMGorAFpM+TK0ILT8KPijekczF0wrAhJDe3rTENd8dyVhOXxk0gksVMJfSu6p82sOMFsNDpMnWZ7gABtSApSNqgA3Yg3Er2AoYixSKEcwCX3MqDHYx/jBRQIMETGmXnIMBhERwUZQAuopNCDkOREhsNkpCNslppgAiwBpIRCoTCNkA==

Remembering that the dataset that Pivot is using is a 1/128 random sample of wmf.webrequest, we're looking at circa 80 k requests/day to a URL like that.


  • 2) Not sure what the process (if any other than an IRC ping) is needed to run this query (per above)

Doing initial exploratory work in Pivot was highly recommended by the folk in the analytics-engineering channel as it's remarkably cheap/unimpactful.

Edit

… and once you've explored Pivot's dataset as best you can, then start your queries small (an hour of data), then a couple of hours, then a day. If you're thinking about running a query for a week or a month (or longer), then raise your hand in their channel and they can help you out.

The following URL shows that Chrome 60; Firefox 55 and IE11 make up the majority of views to the generate PDF https://pivot.wikimedia.org/#webrequest/totals/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkdmeANroVazsApU6jFmw55uOfETKUlxpq3adLvAUNHj8IhUt3OLZVUA/BkxACVicgBzcSUAEwBXBmI9XgAFAEYAESpmMGorAFoAZgAGeUq0ILT8KPijekczF0wrAhJDeybTENd8dyVhOXxk0gksVMJfGr7Z82sOMFsNbpMnBf7gABtSApTtqgA3Ym3Er2AoYixSKEcwCSPMqEnYl/jBRQIMEUmmXnIGAYIlOCkqAF1FJpwchyIlttslIQ9qtNMBkWANDC0YlmNQGBJiLFyGBgDC4QigA==

Let's test the hypothesis that users are using older browsers with little or no PDF support to download PDFs.

The hypothesis is invalid from the above.

and once you've explored Pivot's dataset as best you can, then start your queries small (an hour of data), then a couple of hours, then a day

Is a 1/128 random sample not enough to get these answers?

FWIW I've spent over 3 hours on this task now and I feel like the returns are diminishing....

Is a 1/128 random sample not enough to get these answers?

It should be. The advice from the AE folk was general, not specific to this task. If we only have to use Pivot, then great!

Okay I've summarised this here: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Userbase

The main learning from this is most of the users of this service are on Windows based machines.

This comment was removed by Jdlrobson.

Okay I've summarised this here: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Userbase

The main learning from this is most of the users of this service are on Windows based machines.

👍

Confirmed below:

1+----------------+-----------+-----------+-----------------+----------------+----------------+-------+
2| os_family | os_major | os_minor | browser_family | browser_major | browser_minor | n |
3+----------------+-----------+-----------+-----------------+----------------+----------------+-------+
4| Windows 7 | - | - | Chrome | 60 | NULL | 2931 |
5| Windows 7 | - | - | Firefox | 55 | NULL | 2773 |
6| Windows 7 | - | - | IE | 11 | NULL | 2229 |
7| Windows 10 | - | - | Chrome | 60 | NULL | 2145 |
8| Windows 10 | - | - | Firefox | 55 | NULL | 1596 |
9| Windows 7 | - | - | Chrome | 61 | NULL | 1470 |
10| Windows 10 | - | - | Chrome | 59 | NULL | 1318 |
11| Windows 10 | - | - | Chrome | 61 | NULL | 1239 |
12| Windows XP | - | - | Chrome | 49 | NULL | 1213 |
13| Windows 8.1 | - | - | Firefox | 55 | NULL | 681 |
14| Windows 8 | - | - | Chrome | 50 | NULL | 648 |
15| Windows 8.1 | - | - | Chrome | 60 | NULL | 628 |
16| Windows 10 | - | - | Edge | 15 | NULL | 572 |
17| Windows 10 | - | - | IE | 11 | NULL | 547 |
18| Windows XP | - | - | Firefox | 52 | NULL | 486 |
19| Windows 8.1 | - | - | IE | 11 | NULL | 449 |
20| Other | - | - | UC Browser | 9 | NULL | 335 |
21| Windows 7 | - | - | Chrome | 58 | NULL | 323 |
22| Windows 7 | - | - | Opera | 47 | NULL | 299 |
23| Windows 8.1 | - | - | Chrome | 61 | NULL | 296 |
24| Windows 7 | - | - | Firefox | 42 | NULL | 290 |
25| Mac OS X | 10 | 12 | Safari | 11 | NULL | 272 |
26| Windows Vista | - | - | Firefox | 52 | NULL | 254 |
27| Mac OS X | 10 | 12 | Chrome | 60 | NULL | 248 |
28| Windows 7 | - | - | Firefox | 52 | NULL | 230 |
29| Ubuntu | - | - | Firefox | 55 | NULL | 230 |
30| Windows XP | - | - | Iron | 45 | NULL | 206 |
31| Windows 7 | - | - | Firefox | 3 | NULL | 185 |
32| Windows 10 | - | - | Opera | 47 | NULL | 176 |
33| Android | 5 | 0 | Firefox Mobile | 55 | NULL | 173 |
34| Windows 10 | - | - | Edge | 14 | NULL | 170 |
35| Android | 5 | 1 | Chrome Mobile | 49 | NULL | 140 |
36| Windows Vista | - | - | Chrome | 49 | NULL | 137 |
37| Windows 7 | - | - | IE | 10 | NULL | 134 |
38| Windows 7 | - | - | IE | 9 | NULL | 133 |
39| Windows 7 | - | - | Firefox | 51 | NULL | 128 |
40| Mac OS X | 10 | 12 | Firefox | 55 | NULL | 128 |
41| Linux | - | - | Firefox | 52 | NULL | 120 |
42| Windows 7 | - | - | Firefox | 45 | NULL | 118 |
43| Windows 7 | - | - | Firefox | 54 | NULL | 113 |
44| Mac OS X | 10 | 12 | Safari | 10 | NULL | 109 |
45| Windows XP | - | - | IE | 6 | NULL | 108 |
46| Windows 7 | - | - | Sogou Explorer | 1 | NULL | 108 |
47| Windows 7 | - | - | Chrome | 56 | NULL | 105 |
48| Windows 7 | - | - | Chrome | 55 | NULL | 103 |
49| Windows 7 | - | - | Chrome | 43 | NULL | 101 |
50| Mac OS X | 10 | 12 | Firefox | 53 | NULL | 98 |
51| Android | 5 | 0 | Chrome Mobile | 44 | NULL | 98 |
52| Windows 7 | - | - | Firefox | 56 | NULL | 95 |
53| Windows 10 | - | - | Chrome | 50 | NULL | 84 |
54+----------------+-----------+-----------+-----------------+----------------+----------------+-------+
55
56!set outputformat table
57
58select
59 user_agent_map["os_family"] as os_family,
60 user_agent_map["os_major"] as os_major,
61 user_agent_map["os_minor"] as os_minor,
62 user_agent_map["browser_family"] as browser_family,
63 user_agent_map["browser_major"] as browser_major,
64 user_agent_map["browser_minor"] as browser_minor,
65 count(*) as n
66from
67
68 # Analytics Engineering provide a detailed description of wmf.webrequest and
69 # wmf_raw.webrequest here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest
70 wmf.webrequest
71
72where
73
74 # PROTIP: elukey and Joal from #wikimedia-analytics recommend starting small
75 # and increasing the scope of the query until you're satisfied with your
76 # results - start with an hour, then a couple of hours, then a day, and
77 # finally a week.
78 #
79 # This is because year, month, day, and hour are all Hive partitions that
80 # map to hourly imports of Varnish logs in HDFS. By constraining the time
81 # range for your query, you minimise the number of imports that the cluster
82 # will load, parse, and process to produce results.
83 #
84 # NOTE: You should ask in #wikimedia-analytics before analysing a month's
85 # worth of data as the size of the initial dataset is ~35 TB (!).
86 year = 2017
87 and month = 9
88 and day = 26
89 and hour >= 0
90 and hour < 12
91
92 # The Varnish cluster that handled the request (see https://wikitech.wikimedia.org/wiki/Varnish#Cache_Clusters).
93 #
94 # PROTIP: elukey and Joal also recommend selecting a source whenever possible
95 # as it's a Hive partition. Doing so can reduce the size of the initial
96 # dataset by a factor of ~2.
97 and webrequest_source = 'text'
98
99 and uri_path like "%rest_v1/page/pdf%"
100
101 # Exclude known bots and spiders.
102 and agent_type = "user"
103
104group by
105
106 # https://issues.apache.org/jira/browse/HIVE-1683
107 user_agent_map["os_family"],
108 user_agent_map["os_major"],
109 user_agent_map["os_minor"],
110 user_agent_map["browser_family"],
111 user_agent_map["browser_major"],
112 user_agent_map["browser_minor"]
113sort by
114 n desc
115limit 50;

phuedx updated the task description. (Show Details)

Over to you, @ovasileva!

Looks good!! Also interesting. I wonder why it's so many windows users.

Reopening this task per our discussion during standup. I'll try to summarize our discussion shortly.

In summary:

The results presented by @Jdlrobson in T176467#3643492 and on wiki and by me in T176467#3643704, invalidate the hypothesis that the users of this feature (Download as PDF) are using platforms that don't support printing to PDF. Given this, it's not clear why we're supporting this feature as it's not clear who we're supporting it for.

@Jdlrobson suggested that this is an education problem – that we should invest effort in educating users how to print to PDF rather than building and maintaining a system that does it for them. For me, this is compelling, given that the Electron-based PDF rendering service routinely falls over given the diversity and rate of articles (see T176627#3651737) but I don't think we can completely drop the feature either. My feeling is that we should only offer this feature to those that need it and that a next step might be to correctly identify those users.

We all agreed that this needs more discussion.

@ovasileva: Is the above fair? Are there user stories anywhere that we could look at for the Download as PDF feature?

Okay I've summarised this here: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Userbase

The main learning from this is most of the users of this service are on Windows based machines.

The queries above look great, but is there a reason why we didn't simply calculate the OS percentages directly, to quantify what we mean by " majority of our traffic"? One could then also compare them to the OS frequencies from our general pageviews.