Compare user preference for Top Read vs. Trending Articles in Explore feed
Closed, ResolvedPublic

Description

Study report: Comparing most read and trending edits for Top Articles feature

The Explore feed in Android and iOS(?) currently displays the top read articles from yesterday. There is a proposal (T140102) to replace and/or supplement this feed with articles that have seen a recent uptick in edits ( trending articles ) because this feed is updated in real time, and is less likely to contain stale recommendations.

This study will inform this proposal by attempting to answer the following questions:

  1. overall, do people prefer lists that are based on recent page views, or recent edits?
  2. which list contains more articles around topics that people are familiar with from external news sources?
  3. do people perceive a difference in recency/up-to-dateness between the two sources of recommendations?
  4. are any of the factors above mediated by time of day, country of origin, level of Wikipedia use, or daily news consumption?

Please add more tags/subscribers as desired.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2017, 7:29 PM
Capt_Swing triaged this task as Normal priority.Apr 27 2017, 10:35 PM

I've got everything set up and plan to run the first survey batch tomorrow at 9am PDT. See the research page for more details.

@cmadeo @JMinor I've been hit with some unexpected work and it's eating up my time. How urgent is it that you know the full results of the study? I believe I can get initial data analysis done next week. Not a sure thing, though. That analysis may or may not be enough to say "switching to Trending is/is not a good idea", but it will at least help inform your decisions.

Capt_Swing updated the task description. (Show Details)May 30 2017, 12:15 PM

I've updated the study page with some initial results. So far, I've been able to demonstrate that people based in India are less likely to be familiar with topics that appear in a 'top read'-based feed than a 'trending'-based feed.

Starting the week of June 20, I will:

  • complete the analysis of the current dataset
  • complete the report write up
  • schedule a time to discuss these results and implications with the product team, and determine whether more research is desired

Thank you so much for sharing these initial results @Capt_Swing

Great, thanks so much for the update! 👍

Hey! @JMinor only just pointed me at this and there's a potential problem with the A/B test you ran.

  • Firstly, the trending prototype at current time shows things that are not trending but have had high edit activity (e.g. things with a score of 0). This means there is a high likelihood a large amount of those pages are not "known" if that's the metric being tested.
  • Secondly I didn't know a test was going on so I've been tinkering with the algorithm and that may have impacted your results.
  • I think the times make a difference here so I'd be interested in what times you tested at. I've not quite looked into it, but reader's habits in India may not match up to edit habits. For instance I've noticed later in the day in San Francisco, Australian topics tend to trend but by San Francisco morning there's definitely a distinct European flavour. If it's not clear you can ask the trending prototype to limit results to the last 8 hours to increase the likelihood of relevance. I've definitely seen results which are highly skewed to India e.g. bollywood movies so I'd expect time of day/day of week to impact any tests.

Please can you include me in any future research? I'd love to help out.

@Jdlrobson thanks for the heads-up. It definitely sounds like we need to coordinate if there are future tests. I was operating under the assumption that the API was providing consistent results.

I tested at different times of day, roughly corresponding to US (Pacific) morning/India evening and India morning/US evening, to correct for the source of error you outlined here, as well as the relative 'freshness' of the pageview-based recommendations. The workflow wasn't fully automated, so testing times weren't exact, but I kept a roughly equal distribution between the two time periods.

As for alignment between reading and editing habits within a region, that's an interesting observation. Would Indian readers would be more interested in articles that were trending among Indian editors?

Question: when you say the trending prototype shows things with a "score of 0", do you mean that the 'trendiness' value was (close to) 0? I sorted the results by trendiness and used the top 5 in each batch.

In terms of future research: I'd like to do more research on this. This first test ended up being too small a sample to answer the questions I wanted to answer. When @JMinor and I discussed this study, it sounded like the team was considering swapping out 'top read' in favor of 'trending'. Is that (still) the plan? If so, I think it makes sense to do more user testing before implementation, because the change will have consequences in terms of reader engagement, and IMO we don't have a good idea of what those consequences are at this point. If the plan is to refine the trending algorithm a bit more before deployment, then I'm happy to hold off further testing until there's a stable release candidate.

I definitely want to run more studies like this one. It's a good method for evaluating the potential impact of features based on content recommendations. My goal is to make a kit that we can use to run these kinds of tests across many products, cheaply and easily. I'll follow up via email or IRC to gauge interest and discuss next steps.

BTW, do you have a link to a description of how the trending algorithm works?

Thanks @Capt_Swing

Yes, we'd still like to potentially swap editing trends in for "most read". I've paused our work, as I too would like to take the time to do a second pass at the study.

I'll look for your follow-up email, but also happy to set up a meeting to check in where we are at both from the app POV and for the API iteration, discuss the points @Jdlrobson raises and plan out potential next steps.

@Jdlrobson thanks for the heads-up. It definitely sounds like we need to coordinate if there are future tests. I was operating under the assumption that the API was providing consistent results.

Most notably there was an issue with reverts from bots not being counted meaning lots of vandalised articles were showing high up in the results.

I tested at different times of day, roughly corresponding to US (Pacific) morning/India evening and India morning/US evening, to correct for the source of error you outlined here, as well as the relative 'freshness' of the pageview-based recommendations. The workflow wasn't fully automated, so testing times weren't exact, but I kept a roughly equal distribution between the two time periods.

As for alignment between reading and editing habits within a region, that's an interesting observation. Would Indian readers would be more interested in articles that were trending among Indian editors?

Question: when you say the trending prototype shows things with a "score of 0", do you mean that the 'trendiness' value was (close to) 0? I sorted the results by trendiness and used the top 5 in each batch.

That's good to know and would have reduced the likelihood of issues. If trendiness is 0 then the article is definitely not trending, but we never defined a baseline score for "notable". A score of at least 20 at the moment seems like a reliable way to measure trendiness (although I'd recommend 30 to play it safe).

In terms of future research: I'd like to do more research on this. This first test ended up being too small a sample to answer the questions I wanted to answer. When @JMinor and I discussed this study, it sounded like the team was considering swapping out 'top read' in favor of 'trending'. Is that (still) the plan? If so, I think it makes sense to do more user testing before implementation, because the change will have consequences in terms of reader engagement, and IMO we don't have a good idea of what those consequences are at this point. If the plan is to refine the trending algorithm a bit more before deployment, then I'm happy to hold off further testing until there's a stable release candidate.

The algorithm is pretty stable for the time being - I just need to know the week you are doing tests so I know not to touch it.

I definitely want to run more studies like this one. It's a good method for evaluating the potential impact of features based on content recommendations. My goal is to make a kit that we can use to run these kinds of tests across many products, cheaply and easily. I'll follow up via email or IRC to gauge interest and discuss next steps.

BTW, do you have a link to a description of how the trending algorithm works?

Not exactly, but the talk page of https://www.mediawiki.org/wiki/Reading/Web/Projects/Edit_based_trending_service is the best place to ask questions that I can fold into the article.

@JMinor @kaldari I'm ready to run another round of the study, but I've been waiting for budget confirmation before I spend any more $. Should have the answer tomorrow hopefully, in which case I can start the study again on Monday. Is the trending API in a stable place right now?

@Capt_Swing no work is being done right now… just to be sure:

@Jdlrobson can you make sure nothing is changed deployed during the upcoming study?

kaldari removed a subscriber: kaldari.Jul 14 2017, 5:33 PM

Of course, but you'll need to be more specific on what you are testing on (trending.wmflabs.org or the api endpoint?) Also I'd need to know concretely what dates would you be looking at, so I can ensure no changes go out during that time.

I can of course hold off any changes during that time. I have been hoping to filter out protected pages and would like to do a patch before any test but it depends on the timeline.

Capt_Swing moved this task from Staged to In Progress on the Research board.Jul 17 2017, 9:22 PM

Sounds good @Jdlrobson. Running the first batch tomorrow AM.

ggellerman moved this task from In Progress to Radar on the Design-Research board.Jul 21 2017, 9:39 PM

Study complete!

Capt_Swing closed this task as Resolved.Aug 7 2017, 9:01 PM