Page MenuHomePhabricator

[Spike 1hr] Investigate alternate ways of defining reading depth
Closed, ResolvedPublic

Description

Background
With the advent of hovercards, our classic definition of session depth might be misleading in terms of a success metric, as hovercards might cause users to read further into an article, rather than wider through a range of articles. Thus, we'd like to have a definition of "reading debth". @Tbayer suggested the following options as additions to the popups schema that might allow us to measure this

Acceptance Criteria
Investigate the possibility of adding the following to the popups schema and select the one with easier technical implementation:

  • Track browser scroll position
  • enumerating each link - more relative, but might be simpler

Event Timeline

ovasileva renamed this task from Investigate alternative ways of defining reading depth to Investigate alternate ways of defining reading depth.Sep 12 2016, 3:50 PM

When would the events be sent? On every scroll? On interactions add the overall scroll position (in %)?

ovasileva renamed this task from Investigate alternate ways of defining reading depth to [Spike 1hr] Investigate alternate ways of defining reading depth.Sep 12 2016, 4:48 PM
ovasileva updated the task description. (Show Details)

When would the events be sent? On every scroll? On interactions add the overall scroll position (in %)?

I think both might be an option, and we could decide based on how difficult they would be to implement and how likely each is to overload EventLogging at reasonable sampling rates.

The latter is what the Android app does in Schema:MobileWikiAppPageScroll: send the lowest position reached once the user leaves the page, alongside the total scroll flux (up+down, but without jumps) during that page view.

Regarding the first option, perhaps a cheap way of implementing it - restricted to the case of the Popups schema only - might be to simply send the position alongside each link interaction event. In case of the MobileWebSectionUsage schema we already had something similar in form of the section number (together with the total number of section on the page) that was sent on each "scrolled into view" event.

@Jdlrobson What was the approach we used when we did the research about how far the readers read? (for the sections collapsed vs not collapsed)

This is about general reading behavior, not about sending the position on the page with the hovercards interactions.

@Jhernandez if I recall correctly we setup a scroll handling that sent events whenever a section heading was scrolled into view / opened etc...

Scroll handlers should usually be avoided but a similar thing could be used here to give a sense of how far down a user read. We could log timestamp, page height and currently scroll position.

@Jdlrobson What was the approach we used when we did the research about how far the readers read? (for the sections collapsed vs not collapsed)

That's the MobileWebSectionUsage schema I already mentioned above. Among the data we extracted from it were the number of sections opened per pageview and the time spent on a page (pageview session duration, approximated as the time between the opening of the page and the last action on the page recorded in that schema, i.e. a section being opened, collapsed, or scrolled into view).

This is about general reading behavior, not about sending the position on the page with the hovercards interactions.

Sure, the idea behind my second suggestion (enumerating each link, and sending that number along with each existing event from Schema:Popups) was to keep things simple and not to create an entirely new schema or event class that we would then need to debug etc. (Also, I think @ovasileva wants to use this new depth data specifically for the Hovercards analysis, so having it in a separate schema with different sampling would create headaches anyway.)

We should also keep in mind that we are in the process of investigating retention metrics more generally, so in the longer term we should consider aligning these efforts (@JKatzWMF may want to weigh in too).

@Tbayer, @ovasileva can anyone share a link to the current definition of "session depth"? Thanks.

Edit:

To quote Tilman from IRC:

there's actually no canonical definition (yet), but in https://phabricator.wikimedia.org/T139319 we used the term both for the number of link interaction events during one pageview session, and for the number of pageview sessions (more precisely, distinct pageTokens) per browser session

Within the task description, session depth was meant as the number of pages viewed within a browser session (or the number of different pageTokens recorded for that session ID), but I believe we've also been using the other definition (counting link interaction events). This task is more interested in "page depth" with the goal of because able to analyze page depth within the A/B test, something like:
"In group A, users scroll to X position on average, in group B they scroll to Y"

To make things easier let's number the options:
#1: Track browser scroll position
#2: enumerating each link - more relative, but might be simpler

Advantages of #1:*

  • Easy to implement

Disadvantages of #1:

  • Since articles have different heights, this number in and of itself doesn't tell much. For example, the Obama article is long, and Héauville is not. On the Obama article the user may have scrolled a little bit and left, without reading the whole page. And on the Héauville article the user may have read the whole article. By comparing these two scenarios without taking the article length into account we may infer that the user scrolled more on the Obama article. While in fact, the user didn't scroll more relative to the article length. To remedy the problem, we may calculate the scroll position relative to the height of the viewport. This would give us a better picture.
  • Same article viewed in different orientation, browser size return different values for the scroll position, which maybe an issue as the comparison won't be apples to apples.
  • Calculating the scroll position causes the browser to re-calculate styles and layout, which is a performance issue. This is not always problematic, rather it is costly only when the document style or layout has changed.

Advantages of #2:

  • Easy to implement, maybe a tiny bit harder than #1.

Disadvantages of #2:

  • None that I can think of.

Both approaches suffer from the same problem, which is that articles tend to have more links at the top of the article compared to the bottom of the article. This is problematic, because the user may have read until the end of the article but may have not hovered over a link because there are fewer links (compared to the top) or the user is already familiar with the concepts linked because they've read the article.

Given the above comparison, I'd say we implement #2.

@bmansurov - would it be possible to measure the ratio of the scroll position relative to article length? Something like this: http://scrolldepth.parsnip.io/

also, removing this as a blocker - whatever approach we choose, we can implement after a/b test starts

Can "enumerating each link" be explained in a bit more detail? Does this mean counting the total number of links hovered on a page? Adding markup to give each link in an article a consecutive number and logging that number during hovers to try to determine how far they've read? Or performing the calculation on the fly (using an 'a' selector query)? What about content that is sometimes in a different order on mobile vs. desktop (e.g. infobox)? Would we be logging new events for these links (e.g. when they scroll into view), or appending to existing ones?

Option 1 has been explained in pretty good detail regarding implementation strategies; we should give the same respect to option 2. My gut feeling is that #2 will bring with it more of a performance hit than #1, especially on long articles, but I'm not certain.

I understood "enumerating each link" as the relative position of each available link compared to the total number of available links.

I understood "enumerating each link" as the relative position of each available link compared to the total number of available links.

Total number of links on the page (which varies by skin and can be as many as 4000+)? Only links within '.mw-body'? Only Hovercards-capable links? It's also still unclear whether the link's relative number will only be logged during Hovercards events or in some more general way, given Joaquin's comment about how this is about general reading behavior and not specifically Hovercards.

Not all links are available for Popups, there are some conditions in the code. I meant those links only.

@ovasileva yes it's possible.

I think this route might get us where we want without a lot of the disadvantages of #1. Would it be significantly more difficult than #1?

@ovasileva yes it's possible.

I think this route might get us where we want without a lot of the disadvantages of #1. Would it be significantly more difficult than #1?

No, just about the same difficulty.

Can "enumerating each link" be explained in a bit more detail? Does this mean counting the total number of links hovered on a page? Adding markup to give each link in an article a consecutive number and logging that number during hovers to try to determine how far they've read?

Yes, that [edit: the latter, i.e. adding a consecutive number] was the idea.

I should have mentioned this overview page as general context: https://meta.wikimedia.org/wiki/Research:Which_parts_of_an_article_do_readers_read

Or performing the calculation on the fly (using an 'a' selector query)? What about content that is sometimes in a different order on mobile vs. desktop (e.g. infobox)? Would we be logging new events for these links (e.g. when they scroll into view), or appending to existing ones?

As mentioned here, the initial idea was to piggyback them on the existing link interaction events in Schema:Popups, also to minimize extra analysis and instrumentation debugging headaches when applying this new metric to the Hovercards A/B tests. I would be glad to see a more universally applicable metric to come out of this, but in that case we should also make sure to connect with the existing discussions about establishing general retention metrics for Reading.

There may surely be certainly ambiguities on how to define the ordering, and which links to include; and yes, link position is just a proxy for position in page. But if we restrict to Hovercards-eligible links, that should correspond pretty closely to the enwiki clickstream dataset that was the basis for the three papers cited here:
https://meta.wikimedia.org/wiki/Research:Which_parts_of_an_article_do_readers_read#Links_clicked - these researchers were able to draw quite a few conclusions despite these limitations, which makes me optimistic that we too would get usable data out of this approach.

@bmansurov, @Tbayer - given the above discussion I think I would suggest solution 1 paired with the ratio to page length. I think that's something scalable as a metric to other projects as well. Thoughts?

I thought the conversation above pointed at the solution #2.

I'm struggling to follow the conversation so maybe it would help to summarize the conclusion.

Another idea which we may not have considered is to use setInterval or requestAnimationFrame.
This will only work while the browser window is open.
It could log the scroll bottom position of the browser window every 5/10 seconds. If combined with the height of the documentation you could get a good sense of how far down the article they are (what %age).
An event could be logged at the start to give a sense of how many people have left before 10s.

The following code would compute what %age of article has been read:

100 * (window.scrollY + $( window ).height() )/ $( document ).height()

Here is a third option (independent from but partly similar to @Jdlrobson's idea) which @ovasileva and I discussed earlier today:

Measure the time spent on the page, as a different sort of proxy for engagement instead of geometrical reading depth. The Discovery team uses something similar already, see T119352#2618073 and subsequent comments. Perhaps we could reuse some of @EBernhardson's code and basically add the "checkin" events from Schema:TestSearchSatisfaction2 to Schema:Popups? Concretely, the "action" field could take on a new value "checkin", and we could either add a new field storing the checkin interval (10s, 20s, ..., 50s, 60s, 90s, 120s, ...,) that's only used for checkin events, or repurpose the existing totalInteractionTimefield. One advantage would be that has already been tested in production, including uncovering unforeseen complexities such as the "tab not visible" situation brought up at T119352#2619765.

@Tbayer this seems cheaper and could potentially lead to total time spent metric, which is very valuable (and we track on the apps).

Not sure how to help out on this. To be clear what is keeping this from being signed off - a decision from @ovasileva ?

I think we should go with option 3 - should we investigate some ways of doing this and track it with this card?

If we go with the option #3, we should consider the browser width as well. Maybe a ratio of the width to the height. The reason is that the same article may fit in a tablet sized window and not fit in a mobile phone sized window. In the first case if we don't include the window width, we'd assume that the user has read the whole article, but in fact, the user may have read the first sentence only. So, once we add the width into account, we can easily compare only small screens to small screens, and large screens to large screens.

@bmansurov - I'm a bit confused. If we're measuring check-in events based on time (vs scroll position) as @Tbayer suggested in T145388#2661200, scroll position or width would be not be necessary. From what I see we can either:

  1. Measure %ge of article read (using ratio of scroll position vs browser width) and add check-ins once we reach a certain percentage
  2. Measure time spent - add check-ins based on how long they've been looking at the article

I think 1 makes more sense to me in terms of precision, but if we can reuse some of the code from 2, that would be okay as well.

Also, created T147314: Add reading depth to hovercards schema (defined as time spent on page) to hold the a/c coming out of this spike.

Oh I see where the confusion is coming from. I was (wrongly) referring to Jon's suggestion as the 3rd option.

How are we doing with:

and select the one with easier technical implementation

IMO it doesn't seem like there's much difference on implementation effort. Thoughts?

How are we doing with:

and select the one with easier technical implementation

IMO it doesn't seem like there's much difference on implementation effort. Thoughts?

Testing (also via exploratory data analysis) and debugging should be considered part of the implementation effort. Especially considering our recent journey with the Popups schema. To me, this seems a strong argument in favor of adapting the already tested time-spent instrumentation from the TestSearchSatisfaction2 schema (described in more detail as "third option" above).

Ok, proposal, as I believe both are very helpful in telling us what we need to know:

  1. We create both cards
  2. Prioritize the time-spent version as high - get to it in sprint 84, hopefully
  3. Prioritize the %ge read as normal - get to it a few sprints down the road (once we've already experimented with adding ^ to the schema)

Thoughts?

I'd personally just go with @Tbayer's suggestion (time-spent) given that it seems somewhat lighter to implement and that it is a metric that has already been used by another vertical.

For % read there's more variables to add (viewport + scroll position) and more performance overhead (binding to scroll events) so I'd take it off the table unless it is exactly what we need.

None of what I've said goes against creating both tasks and prioritizing as you said, though. So that's a fine outcome.


If we go with time-spent, I'd add a new field to Schema:Popups to store that info instead of repurposing another one (which has value on its own).

@Jhernandez - sounds good. For hovercards, I still think it'd be nice to have the % read as well - since we want to test the hypothesis that users read further into articles, but I'll line up that card later. For now, @Jhernandez, @bmansurov - could you help me fill out the a/c in https://phabricator.wikimedia.org/T147314? We would have to choose an appropriate interval for each check, but to start with, I guess we can take the ones from the discovery team, unless you have something more concrete in mind?

@ovasileva That sounds good to me, but whatever you and Tilman feel is correct. I've edited that description.

I'm resolving for now, let's reopen if we need more investigation.