Page MenuHomePhabricator

Related Pages recommendations user study design
Closed, ResolvedPublic

Description

Ellery has been working on an improved Read More (https://tools.wmflabs.org/readmore/).

Mobile teams want to know if this is actually better. Combined qualitative/quantitative study.

Research Page:
https://meta.wikimedia.org/wiki/Research:Evaluating_RelatedArticles_recommendations

Algorithm:
https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors

Event Timeline

ggellerman triaged this task as Medium priority.

@pizzzacat to meet with @JKatzWMF Aug 5 to explore

This research is waiting for a build - planning is to take place upon review of the build.

@pizzzacat Did you work out what platform you want to do this test on? Are you planning on doing the qualitative portion first?

Just so this isn't lost from the email thread: The potential uses for this service identified by the PMs were:

  1. Using this service to replace the existing "more like" API used to display suggestions in the feed and at the bottom of articles
  2. Adding a new feature to the apps for displaying general recommendations for a particular wiki (results without any search term)

@JMinor @Dbrant @dr0ptp4kt does that sound right to you?

When the S&D team implemented a new param for morelike queries (no boost on popularity) I ran some scripts on articles selected by community and Jon Katz to see the comparison of the results.

https://www.mediawiki.org/wiki/Extension:RelatedArticles/CirrusSearchComparison

It'd be very interesting to compare too with the results of this service.

Ping me if interested, I'm happy to update the scripts and re-run them with this service too (I'd need some docs on how to consume it though)

It may be a good first step to manually identify if the results are better.

Another thing to consider is how we measure improvement. There was some legitimate questioning on the related pages talk page about whether highest click % was really the measure we want. The argument put forward by the community was that we should strive for less popular pages in favor of turning the user on to more obscure, but interesting topics. it turned out that by removing popularity from the criteria, we get much more relevant results (as @Jhernandez links to above). I am not sure, however, that they led to greater click-throughs.

For example, right now the author Will Self has JK Rowling and Bob Dylan as his related pages. When we remove popularity, they will be a primary character in his books, and two of his most famous works (see comparison here: https://www.mediawiki.org/wiki/Extension:RelatedArticles/CirrusSearchComparison#Will_Self) . I think bob dylan might have a higher click through rate, but the novels are clearly more interesting and "related".

Don't know how useful this will be but Discovery has a tool that can be used for humans to test search results:

https://discernatron.wmflabs.org/login

Code is here:

https://github.com/wikimedia/wikimedia-discovery-discernatron

Just so this isn't lost from the email thread: The potential uses for this service identified by the PMs were:

  1. Using this service to replace the existing "more like" API used to display suggestions in the feed and at the bottom of articles
  2. Adding a new feature to the apps for displaying general recommendations for a particular wiki (results without any search term)

@JMinor @Dbrant @dr0ptp4kt does that sound right to you?

Yes, @Fjalapeno, I think you captured it. Productizing #2 for the web at some point (e.g., template/magic word injectable component, standalone feature exposed from an affordance, etc.) also seems sensible, although one step at a time.

Copying and pasting something I wrote from an email, for future selves:

Those search-less recommendations in his prototype are coming from a sorted random shuffle of pageview API popular pages, and I've no doubt we could easily replicate the logic in MCS.

Another thing to consider is how we measure improvement. There was some legitimate questioning on the related pages talk page about whether highest click % was really the measure we want. The argument put forward by the community was that we should strive for less popular pages in favor of turning the user on to more obscure, but interesting topics.

@JKatzWMF totally agree here. As far as I know the only metric we have ever used to measure the effectiveness of "related pages" is click through. I can definitely see that the community has valid point in removing the popularity boost, but I have concerns that we didn't then define a new metric more aligned with the goals of the community after this point was raised and before we committed to making the change.

it turned out that by removing popularity from the criteria, we get much more relevant results (as @Jhernandez links to above).

I think better relevancy is a good thing, but this seems very subjective and doesn't provide a clear path for us to iterate and improve the service. Do we have any metrics to support this?

I ask this not to refute the claim, but to make sure we can we can articulate to User Research what we as Reading think is important in a "related pages" service and we want them to measure to help us improve.

Do we have any thoughts on a new metric? Total reading time? Time spent on a clicked on page? Frequency of clicks instead of total number of clicks? Whatever we decide to optimize for, we should do this before User Research starts working on this study.

Please work with discovery on methods - assessing lists of ranked results
is what they do.

Please add me to meetings that come up for this, if my help is useful. Thank you!

Supporting the "more like" functionality in CirrusSearch is part of Discovery's annual plan, so Discovery can potentially assist here with experimental design and analyst time. Let us know if that's useful.

One idea for comparing the two would be to run an quick experiment on Amazon Mechanical Turk. For some set of articles, generate recommendation sets from both systems and ask the turkers to compare the quality/relevance. You would have to take some care in designing your question, but it could be something along the lines of: "Which set of recommended topics would you consider more relevant to the seed topic?" Then we can get a confidence interval over what fraction of users prefer one version over the other. After choosing a set of seed articles and nailing down the question, this should be a pretty fast and cheap way to get a first assessment.

That test could maybe favor popularity over relevancy of related articles.

In my manual comparisons I found myself recognizing more of the results with the boost popularity+links queries, but the results were less related. In comparison, without the boosting, the related list was composed of unknown topics, but when I entered the article they turned out to be more accurately related to the previous page.

It depends on what we want to optimize for. Pageviews & eyeballs are usually at odds with quality and relevancy of recommendations.

Another interesting topic is how to increase interest/engagement once we have quality of recommendations. A title with a short description may not always be the best way to lead the user into learning more about the topic. Maybe a hovercards like popup with the lead paragraph could be more informative.

Bumping this -- any ideas of when/if this needs to happen?

Been discussing this with @JKatzWMF and @DarTar

Dario, reiterated much the same as @ellery has said above… there will be ways to incorporate qualitative analysis by letting testers compare results. Additionally, he had some ideas on quantitative metrics we can use that are better than just the "click throughs".

To move this forward, we began a discussion around a way for Reading and Research to work together to come up with a testing and productization plan - potentially around a joint Q3 goal.

We need to hash out some details, so I'll set a meeting where we can discus this more in depth.

cc @Tnegrin

I mentioned this today during our Staff meeting. I think we could scope out a potential collaboration for Q3.

On our end, we would like to work out a general approach for the evaluation of recsys (whether contribution or consumption-focused) and bring in some best practices and methods from the (ample) literature on the topic. Work on recsys is part of our annual goals and we'd be happy to collaborate.

The main issue is the timeline and scope of the collaboration. We would certainly like to work on an approach that can be reused across different types of recommender systems, as opposed to doing tactical evaluative work on a specific algorithm implementation.

@Fjalapeno could you set up some time to connect to @Capt_Swing (who has already spent cycles working on this topic) and come up with some shared understanding on the nature and timeline for this project?

First time seeing this thread. Sounds fascinating! I look forward to learning more.

Update:

I had a great meeting with @Capt_Swing today. (Thanks for all the information!)

Jonathan had some good ideas on how we can proceed with do user testing of this service. At the same time, we both want to do this in a way that can be re-used for user testing across all the different recommendation/AI services that are being developed.

We have some concrete next steps to move forward:

  1. @Capt_Swing will be setting up a meeting with myself an @ellery to get a better technical understanding of the new recommendation service.
  2. Following that meeting we will work with Reading Product Managers and @pizzzacat to better understand the behaviors we would like to encourage. This will allow us to design a user study to evaluate these behaviors, as well as begin to develop quantitative measurements to measure the success in beta testing and production.

While I don't want to formally commit to a timeline yet, our current plan is to get all of this information over the next few weeks so we can push towards a Q3 goal.

We will update this task as soon as we have had the meeting, so we can move forward to the next stage.

Thanks @Fjalapeno. Also relevant, I've pitched a related Dev Summit session: T149373: Evaluating the user experience of AI systems

Please subscribe if you're interested in having a broader conversation around this type of evaluation in January!

Quick update…

@Capt_Swing and I met with @ellery yesterday and discussed the algorithm and methodology that is being used in the new service. I think we have enough information to move forward and discus the behaviors we are trying encourage from our users.

Will be sending out a meeting invite/agenda soon

@pizzzacat and I discussed this and decided that we would collaborate on the study plan, since it's the first time WMF has done a UX research study of this type.

It would be nice for the study to address how the algorithmic solution compares to the manually curated solution i.e. categories, which are currently artificially hidden for many users (cf. T24660#2817112, T71984#1704390, T73966#2817082).

@Nemo_bis this is a fair point. I'll look into it.

Capt_Swing renamed this task from Plan study for new Read More algorithm to Related Pages recommendations user study design.Jan 26 2017, 11:32 PM

I'm curious if the study comparing Wikipedia Navigation Vectors with the boosted or non-boosted version of CirrusSearch 'morelike:'.

@bearND current plan is to test the classic_noboostlinks profile, since that is (I believe) the current implementation for RelatedPages.

Sounds good. Thanks. Just wanted to have this explicitly stated.

@Capt_Swing Will you continue the research on this topic? I'm asking because I plan to do a similar evaluation for link-based recommendation algorithm but instead a user study I want to use the Android app to conduct an online evaluation. (See: T142477)

@mschwarzer I don't currently have any plans to do additional evaluations around article recommendations, but it's always possible that someone will request that work in future. Happy to talk about it if you want to.

@Capt_Swing I'm very interested to talk to you about it. But I guess this ticket is the wrong place for it. I already sent an email to you a few weeks ago: ms (a) mieo.de.

@mschwarzer Jonathan is on vacation. He'll pick this up when he's back.