Analysis data in MobileWebSectionUsage
Closed, ResolvedPublic

Description

Questions that need answering:

  • In what percentage of sessions in the main namespace does a user open a section?
  • What percentage of users that open a session also have service worker installed?
  • What is the median number of sections a user opens in a given session?
  • What is the median number of sections in article

Duration: 1hr

Jdlrobson updated the task description. (Show Details)
Jdlrobson raised the priority of this task from to High.
Jdlrobson added subscribers: Tbayer, Gilles, bearND and 16 others.
Jdlrobson added a subscriber: JKatzWMF.

@JKatzWMF can you or Tilman do this? I would like this done next sprint.

@Tbayer I think this is doable, let me know if not.
@Jdlrobson, by session do you mean 'pageview'? Not sure how we can manage session as 'site visit' without more infrastructure.

I have a nasty looking query snippet I use for medians, as sql does not handle well on its own:

Medians via SQL--in the code below, I am looking for the median 'event duration', where event_duration is a field, for section opens per page you will obviously need some additional operations:

SET @prev =0;
SET @rownum = 0;

SELECT sq.dates, AVG(sq.event_duration) AS median_val, sq.row_number FROM (

SELECT t1.row_number, t1.event_duration, t1.dates FROM(


SELECT IF(@prev < d.dates, @rownum:=1, @rownum:= @rownum +1) AS row_number, d.event_duration, @prev := d.dates, d.dates AS dates
FROM (
SELECT *, LEFT(TIMESTAMP, 8) AS dates
FROM Popups_11625443
ORDER BY dates
) d
ORDER BY dates, event_duration)
AS t1

INNER JOIN
(
SELECT count(*) as total_rows, left(d.TIMESTAMP, 8) AS dates 
  FROM Popups_11625443 d
  
  GROUP BY dates
) AS t2

ON t1.dates = t2.dates

WHERE 1=1 
## this is currently set to median, but divide by 4 if you want 1st quartile and divide by 4/3 if you want 75th%...etc.  (divide by 2 for median...)
AND t1.row_number >= t2.total_rows/2
AND t1.row_number <= t2.total_rows/2+1
ORDER BY t1.dates
) sq

GROUP BY sq.dates
ORDER BY sq.dates
;

See sessionId on https://meta.wikimedia.org/wiki/Schema:MobileWebSectionUsage

We use a random token to capture the device/user to allow us to tie entered events with open-section / close-section

Technically it should not expire so if you see duplicate session tokens it means it's the same user.

Happy to look into this - I think writing the queries themselves won't be very much work (thanks for the snippet, @JKatzWMF), but we should make sure we are interpreting the data correctly.

@Jdlrobson , a few quick questions as I'm starting to look at this:

  1. Just to double-check that I'm assuming correctly: Does event_isTablet = 0 exclude all those readers who are shown the sections uncollapsed on initial page load already? (E.g. if I view en.m.wikipedia.org on my desktop computer in a large enough browser window, does that count as a tablet view?)
  2. What user action does "entering the schema" correspond to, exactly?
  3. At what point is this session ID (token) generated, and how is it stored?

Happy to look into this - I think writing the queries themselves won't be very much work (thanks for the snippet, @JKatzWMF), but we should make sure we are interpreting the data correctly.

@Jdlrobson , a few quick questions as I'm starting to look at this:

  1. Just to double-check that I'm assuming correctly: Does event_isTablet = 0 exclude all those readers who are shown the sections uncollapsed on initial page load already? (E.g. if I view en.m.wikipedia.org on my desktop computer in a large enough browser window, does that count as a tablet view?)

Means is browser window > 768

  1. What user action does "entering the schema" correspond to, exactly?

The user has been bucketed and toggling is enabled and available

  1. At what point is this session ID (token) generated, and how is it stored?

Generated and cached in local storage. In future visits same value if stored successfully is reused

dr0ptp4kt moved this task from Backlog to Feature on the Reading-Admin board.Nov 13 2015, 12:51 AM
dr0ptp4kt moved this task from Feature to Doing on the Reading-Admin board.

The percentage of mobile web users whose device could support Service Worker, among those who open a section:
30.5% (as of November 25)

As a sanity-check, the percentage among all users bucketed in this schema is 29.4%, and the percentage among all events is 29.8%.

--------------
SELECT COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151125%' AND event_eventName = 'open-section'
--------------

COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId)
511290


--------------
SELECT COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151125%' AND event_eventName = 'open-section' AND event_hasServiceWorkerSupport = 1
--------------

COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId)
156018

156 018 / 511 290 = 0.3051...

--------------
SELECT COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151125%' AND event_hasServiceWorkerSupport = 1
--------------

COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId)
453906


--------------
SELECT COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151125%'
--------------

COUNT(DISTINCT event_hasServiceWorkerSupport, event_sessionId)
1543384

453 906 / 1 543 384 = 0.294...

--------------
SELECT SUM(IF(event_hasServiceWorkerSupport =1,1,0))/SUM(1) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151125%'
--------------

SUM(IF(event_hasServiceWorkerSupport =1,1,0))/SUM(1)
0.2984

...

  1. What user action does "entering the schema" correspond to, exactly?

The user has been bucketed and toggling is enabled and available

  1. At what point is this session ID (token) generated, and how is it stored?

Generated and cached in local storage. In future visits same value if stored successfully is reused

@Jdlrobson (welcome back!), after looking at this more closely, I think we need to discuss further how useful this schema is for answering question #1 (percentage of sessions where the user opens a section). From conversations elsewhere, I now understand that the ID does not expire. That means that the "entered" event basically only ever occurs once per user, right? And we don't have a baseline of active sessions on a given day, say - we register only those where "open" and "close" events occur.
We could restrict queries to those sessions that also have an "entered" event on the same day. But that would distort the result a lot toward new readers who have never used the site before (in the time since the instrumentation was created).
Let me know if I'm missing something here. In case the "entered" event is actually triggered more often than that, we need to specify when exactly.

Tbayer added a comment.EditedNov 30 2015, 7:38 PM

In the meantime, an answer related to the fourth question (median number of sections per page, weighed by usage on mobile web). Here too we are limited by the instrumentation. Limited to "entered" events, i.e. IIUC the first ever Wikipedia/Wikimedia page that each reader sees (since the schema was installed), the median was 5.5 sections during November 25. Here's the histogram:


(It's actually cut off on the right, e.g. there were still 13 events on pages that had 100 sections and the longest page showing up in this sample had 369 sections.)

Query:

SELECT event_sectionCount AS sectionCount, COUNT(*) FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151125%' AND event_eventName = 'entered' GROUP BY sectionCount ORDER BY sectionCount
Tbayer added a comment.Dec 2 2015, 6:48 AM

...

  1. What user action does "entering the schema" correspond to, exactly?

The user has been bucketed and toggling is enabled and available

  1. At what point is this session ID (token) generated, and how is it stored?

Generated and cached in local storage. In future visits same value if stored successfully is reused

@Jdlrobson (welcome back!), after looking at this more closely, I think we need to discuss further how useful this schema is for answering question #1 (percentage of sessions where the user opens a section). From conversations elsewhere, I now understand that the ID does not expire. That means that the "entered" event basically only ever occurs once per user, right? And we don't have a baseline of active sessions on a given day, say - we register only those where "open" and "close" events occur.
We could restrict queries to those sessions that also have an "entered" event on the same day. But that would distort the result a lot toward new readers who have never used the site before (in the time since the instrumentation was created).
Let me know if I'm missing something here. In case the "entered" event is actually triggered more often than that, we need to specify when exactly.

For the record: @Jdlrobson and I talked this through in person today, and he clarified that the "entered" event is in fact triggered whenever the user opens a new page (provided that the user has been bucketed and that section toggling is enabled and available, i.e. the conditions specified earlier above) - not just when the session ID is first set. I have also added this information to the schema description. That should remove the concern regarding question 1, I'm running more queries now.

Tbayer added a comment.Dec 3 2015, 5:11 AM

Update: Having clarified the definition of the "entered" event above (there are various meanings of "session" floating around here...), I first reproduced @Jdlrobson's query from his Nov 6 email, now with a full day of current data:

39.9% of the non-tablet mobile users who viewed a mainspace page on November 30 opened a section there.

This basically confirms JonrR's earlier result (43%) about the first question.

data source
Number of non-tablet users entering the schema (by viewing a mainspace page) on Nov 30:

--------------
SELECT COUNT( DISTINCT event_sessionId ) FROM log.MobileWebSectionUsage_14321266  WHERE timestamp LIKE '20151130%' AND event_eventName = 'entered' AND event_isTablet = 0 AND event_namespace = 0
--------------

COUNT( DISTINCT event_sessionId )
1297407

Number of non-tablet users opening a section on a mainspace page on Nov 30: 


--------------
SELECT COUNT( DISTINCT event_sessionId ) FROM log.MobileWebSectionUsage_14321266  WHERE timestamp LIKE '20151130%' AND event_eventName = 'open-section' AND event_isTablet = 0 AND event_namespace = 0
--------------

COUNT( DISTINCT event_sessionId )
517233


517 233 / 1 297 407 = 0.3986..

(BTW these two queries took a bit over six hours each, in parallel.)

Tbayer added a comment.Dec 3 2015, 5:28 AM

Now on to the third question:
"What is the median number of sections a user opens in a given session?"
For "session" interpreted as user (i.e. all events with the same session ID), we already know the answer from the last result: Because more than half of all users don't open a session, the median is 0.

But "session" should be interpreted differently for this question, basically as pageview - or more precisely, the set of all events in a given timeframe (e.g. 1 day) with the same session ID (user) and page ID (and on the same wiki, to account for pages on different wikis having identical page IDs, even though that kind of collision is very unlikely here). The median is very likely 0 too, but I'm running a query to confirm and to get the full histogram:

SELECT sections_opened, COUNT(*) as number_of_pageviewsessions FROM (SELECT event_sessionId, event_PageId, wiki, (COUNT(*)-1) AS sections_opened FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151201%' AND (event_eventName = 'entered' OR event_eventName = 'open-section') AND event_isTablet = 0 AND event_namespace = 0 AND event_sessionId <> '' AND event_pageId <> 0 GROUP BY event_sessionId, event_PageId, wiki) AS pageviewsessions GROUP BY sections_opened ORDER BY sections_opened

I hope this query will finish by tomorrow, will report the result here.

It would be interesting to know for users who do open a section how many sections they tend to load (do they open all of them, a few of them - just one and get bored?) - this would help inform whether to serve all content on scroll or to load content of sections when opened.

@Tbayer @Jdlrobson
Had a brief conversation with @madhuvishy and @kevinator about bots and they seemed to think that around 30% of our traffic is bots, which would probably register as sessions/pvs (for the event logging purposes) but not for opens. So it might make sense to take the section opens/pages that you find as the lowest possible figure that could be significantly higher.

@Tbayer @Jdlrobson
Had a brief conversation with @madhuvishy and @kevinator about bots and they seemed to think that around 30% of our traffic is bots, which would probably register as sessions/pvs (for the event logging purposes) but not for opens. So it might make sense to take the section opens/pages that you find as the lowest possible figure that could be significantly higher.

Good point, see also T117631.
My understanding has been that web crawlers prefer the desktop site (as an impression from e.g. the HTTPS-only rollout in June where the drop in "human" pageviews - which we conjecture to have consisted mostly of undetected bots -happened mostly on desktop). But we can also try to check user agents. @madhuvishy and @kevinator, do you know of an easy way to apply the bot detection logic that we use for pageviews on Hive to EventLogging, too?

@Tbayer I think applying a regexp to the UA column in Eventlogging is the simplest way to start. You can probably borrow from the implementation of the PageView definition. There's also another regexp developed by research that @madhuvishy might be able to dig up.

Here is the result of the above mentioned query regarding the third question (about the number of sections a user opens in a given session):


As anticipated, the median is 0.

The same on a log scale:


BTW: Without having done an actual fit, in this range (0-12 opens and a bit beyond) it looks pretty linear in the log view, meaning that the frequency drops exponentially as number of section opens increases.

(For the record, this query actually only took 100 minutes, and JonR and I already discussed the result a bit yesterday before drafting T120292: MobileWebSectionUsage schema changes to measure impact of section collapsing .)

It would be interesting to know for users who do open a section how many sections they tend to load (do they open all of them, a few of them - just one and get bored?) - this would help inform whether to serve all content on scroll or to load content of sections when opened.

This is already partially answered by the histogram above - although it doesn't account for the overall number of sections in each page, which plays a role too,

Query and raw result below:

--------------
SELECT sections_opened, COUNT(*) as number_of_pageviewsessions FROM (SELECT event_sessionId, event_PageId, wiki, (COUNT(*)-1) AS sections_opened FROM log.MobileWebSectionUsage_14321266 WHERE timestamp LIKE '20151201%' AND (event_eventName = 'entered' OR event_eventName = 'open-section') AND event_isTablet = 0 AND event_namespace = 0 AND event_sessionId <> '' AND event_pageId <> 0 GROUP BY event_sessionId, event_PageId, wiki) AS pageviewsessions GROUP BY sections_opened ORDER BY sections_opened
--------------

sections_opened number_of_pageviewsessions
0       780431
1       270149
2       118197
3       57584
4       28935
5       14940
6       7922
7       4230
8       2234
9       1183
10      725
11      420
12      247
13      158
14      108
15      57
16      49
17      34
18      27
19      34
20      30
21      11
22      16
23      17
24      15
25      11
26      10
27      6
28      6
29      1
30      3
31      2

@Tbayer is there a wiki link I can point to and close out this task?

Tbayer added a comment.EditedJan 22 2016, 7:16 PM

@Jdlrobson See my email from December 29: I summarized some things on this Meta-wiki page. (I've also uploaded another of the above charts, which had been used in the presentation at the dev summit, to Commons here.) Let me know whether this works for your purposes, or what else is necessary.

@Tbayer That section of the wiki page is fine, but I feel that the vizs on https://phabricator.wikimedia.org/T118041#1854054 would be much more interesting than those two concrete articles, along with a small comment about them (median is 0, etc).

Would you mind updating it?

Very interesting!

@Tbayer That section of the wiki page is fine, but I feel that the vizs on https://phabricator.wikimedia.org/T118041#1854054 would be much more interesting than those two concrete articles, along with a small comment about them (median is 0, etc).

Would you mind updating it?

Very interesting!

OK, so the fact that the median is 0 was already implied mathematically by the result that only 39.9% of users opened a section, but you're right, it's worth calling that out explicitly and adding the chart of sections opened per pageview as illustration. Just did so.

Tbayer closed this task as Resolved.Mar 1 2016, 11:48 PM
Tbayer claimed this task.
Tbayer added a project: Reading-analysis.

Per discussion with JonR, I have closed this now, because all four questions in the task description have been answered and the examination of the subsequent A/B experiment is taking place in T128536.

@Tbayer is https://meta.wikimedia.org/wiki/Research:Which_parts_of_an_article_do_readers_read still the best page to capture the outcome of this research or are there other wikis? Let's make sure we link to them from https://www.mediawiki.org/wiki/Reading/Web

Tbayer added a comment.Mar 3 2016, 9:07 PM

@Tbayer is https://meta.wikimedia.org/wiki/Research:Which_parts_of_an_article_do_readers_read still the best page to capture the outcome of this research or are there other wikis?

For the present ticket, I think that page still works OK. In general, there has been a call recently to publish things more often in the framework of the research index on Meta and I think this might make sense for many of our data analysis too, as it was done for various Mobile projects in the past. I may create such a Meta page for recording the outcome of T128536 (the A/B experiment).

Let's make sure we link to them from https://www.mediawiki.org/wiki/Reading/Web

Perhaps create a subpage like https://www.mediawiki.org/wiki/Reading/Web/Analysis_and_research ? This should also collect the various results we have been publishing here on Phabricator.

Danny_B renamed this task from Spike: Analysis data in MobileWebSectionUsage to Analysis data in MobileWebSectionUsage.Jul 11 2016, 6:59 PM
Danny_B edited projects, added Spike; removed Tracking.