Page MenuHomePhabricator

Make it possible to A/B test different section headings on mobile web
Closed, DeclinedPublic

Description

From the 2015/16 A/B test of collapsed/uncollapsed sections on mobile web, we got as a byproduct some very interesting data about the differing levels of interest by readers for the individual sections of a Wikipedia article, e.g.:

en_Pneumonia section opens.png (1×1 px, 92 KB)

In the Q&A of @Tbayer's recent Wikimania presentation about this and related data, it was suggested to take this a step further to enable Wikipedia editors to test which of two alternative headings for a given sections works best for readers. @Doc_James gives the example deciding between "epidemiology" and "frequency" in articles about diseases.

Making this possible will likely involve the following parts:

  • Reactivate Schema:MobileWebSectionUsage in some form, possibly in connection with the upcoming work on T199157: [Spike ??hrs] Sticky header instrumentation. Thanks to the new Hive-based EventLogging setup deployed last year, we should be able to use a higher sampling rate than back then, making this data usable on more (than just the highest-traffic) articles. It will likely still be too noisy for making solid conclusions about low traffic articles.
  • Update Schema:MobileWebSectionUsage to consider the mobile setting that allows users to have sections expanded by default (as this will impact the initial state of section expanding/collapsing)
  • Serve two different versions of a section heading as part of an A/B test, as specified by editors using e.g. a magic word or template (say {{SECTIONVERSIONS|Epidemiology|Frequency}} or such).

Developer notes

There are at least 3 tasks I (@Jdlrobson ) can see as part of this epic.

Restoring Schema:MobileWebSectionUsage

Code from client has long been removed (2 years ago)
https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/267729/

Since we have a reference point from doing this before, it should be easier to restore this code then it would be if we were to write it from scratch, but much of the code has changed since this patch. One would hope this to be a 5 point estimation.

Serve two different versions of a section heading as part of an A/B test straw man proposal

To run an A/B test against anonymous users, we are restricted to sample by page rather than user session.
I would thus recommend that we create a magic word to serve different headings.

Given a section heading "prognosis" editors would need to opt in certain pages into the A/B test like so:

== {{SectionAB:Prognosis}} ==

On the server side we might render this magic word 20% of the time with "Prognosis" and 20% of the time "Likelihood of recovery". 60% of the time (for the control group), we would use the current treatment e.g. Prognosis. We would then let Schema:MobileWebSectionUsage know which group the user was opted into. From this we would be able to measure which section heading was more successful.

To allow editors to run arbitrary tests, we'd need some way to register tests I'd recommend doing so via a interface page e.g. MediaWiki:SectionABTest mapping section headings to alternative section headings.

Prognosis: Likelihood of recovery

Note we would need to ensure id and name attributes are retained so links continue to work.

client based solution?

Given the importance/visibility of sections for table of contents, collapsed headings and section linking within articles a client based solution that samples by user while viable has a few more complications - the client side table of contents and noticeable changes in the content of the page.

JS code only runs when the DOM is ready which can come quite late meaning there is high possibility of a jarring/visible update to the page and table of contents on tablet (if it has been opened).

This will particularly be problematic/unavoidable if a section link is opened as the browser will scroll to the heading prior to it changing.

We could possibly migitate this by finding a creative client+server solution which hides/blurs the section heading/content until the ab test has loaded.

Recommendation: Document A/B test results.

I (@Jdlrobson) would recommend any outcomes from A/B tests resulting from this work are documented in a special heading under https://www.mediawiki.org/wiki/Recommendations_for_mobile_friendly_articles to guide future editors.

Open questions

Section numbers

I'm not sure if it would impact the A/B test, but it's possible(likely?) that the further down the page a section is, the less likely it will be read.
Thus if we ran an A/B test on "Prognosis" vs "Likelihood of recovery" on a variety of pages where the section headings were positioned in different places, how would we remove noise relating to the position of the section heading ?

Event Timeline

ovasileva subscribed.

Tagging this as an epic and moving to needs analysis. We should probably meet separately to discuss how we would want to split individual tasks out

If needed I am happy to personally change the heading on EN WP from one to the other option as part of this test. We could maybe start by looking at 3 articles?

The other related question is "how much does the position of the section effect the chance it is opened?"

By the way the answers to these types of questions would be of great value to our editors and our readers as it would give editors the information they need to make Wikipedia better for our readers.

If needed I am happy to personally change the heading on EN WP from one to the other option as part of this test. We could maybe start by looking at 3 articles?

Cool - I think it will be rather easy to scale this after the first article, but the instrumentation and infrastructure would need to be in place already.

The other related question is "how much does the position of the section effect the chance it is opened?"

This won't need A/B test and can in principle be investigated using the old (2015/16 data already), although data from the reactivated and revamped instrumentation will be preferable.

To run an A/B test against anonymous users, we are restricted to sample by page rather than user session.
I would thus recommend that we create a magic word to serve different headings.

This conclusion is based on the unstated premise that the selection of the heading has to be done by the server. On the client, we can base the choice on the page, the pageview, or the session.

(Moving a few things from the task description here into the comments as they seems more discussion contributions than something we all are ready to commit to as part of this task:)

Recommendation: Dashboard

We'd probably want to provide dashboards with help from analytics, so editors can run these A/B tests themselves without assistance from us - otherwise I can imagine this being a burden on Tilman's time.

I appreciate the consideration for my time, but it seem that this task is much more likely to become blocked on engineers, especially since (per T200810#4466093 ) this will rather start out as a pilot project; after all it will be the first time we ever test Wikipedia textual content in this way.

And more importantly, such a dashboard seems a quite ambitious project, considering that we currently don't even have equivalent self-serve solutions for our existing A/B tests on the product/software development side. Even if we commit to it, I can see quite a few questions about how it would actually look like and how it would deal with limitations like the dynamic nature of wiki pages (which we can account for when running tests manually, but would need to automate in such a dashboard). Hence I would suggest to split this into a followup task, if it is needed.

(likewise moved here from the task description - it seem that these are thoughts about the interpretation of the resulting data and suggestions what to take into account during its analysis, which is always valuable but seems offtopic for this task per se:)

Short term vs long term impact

Note we should be cautious in time we run such as experiment new headings may arose curiosity. It is possible with a new heading, readers are more likely to click it to find out what kind of information they can find inside.

Such novelty effects may be possible, but they don't prevent us from running user interface A/B tests either, and because the ratio of repeat readers to the same page is likely rather low, I would expect them to be even less of a problem here.

It might be that rather than the heading, the content or the delivery of that content is a problem and over time the section headings themselves become associated with that content and are less preferred on mobile. The references section for example may be rarely used, not because of the title, but due to the fact that most mobile users know that clicking on an inline reference will show the associated reference.

Good point (I dwelled on it in my Wikimania presentation too), but I fail to see what it has to do with the implementation of the present task.

I would thus not recommend doing this A/B test for sections such as "External links", "References" but more for sections where technical words are used, where different language may lead to more accessible content.

Lowering the priority on this for the time being as it is growing larger in scope and not currently a part of our quarterly or annual plan. However, I can see us coming back to this sooner if we decide to tackle it at the same time as T151115: [EPIC] Improve in-article navigation. I can also envision tests like these potentially incorporated into the better use of data program - I'd be interested in seeing us develop a structure around how to A/B test portions of content (headings and more)

@ovasileva what is the state of this epic? Is it something we plan to work on within the next 2 years or can we decline/resolve it?

@ovasileva what is the state of this epic? Is it something we plan to work on within the next 2 years or can we decline/resolve it?

Let's keep this one open - I'm hoping we can get to it within the next 2 years.

ovasileva edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

Seems like two years later, we still haven't gotten to this. Declining for now but can re-open when it becomes a priority again.