Page MenuHomePhabricator

Run A/B test to evaluate impact of topic subscriptions
Open, MediumPublic

Description

This analysis is intended to help us understand the impact Topic Subscriptions are having on the likelihood that contributors, across experience levels, will respond more quickly to comments that are posted on wikitext talk pages.

Decision to be made

Should Topic Subscriptions be offered to all logged in volunteers, at all Wikimedia wikis, by default (read: as an opt-out user preference)?

A/B test timing

  • Start date: Thursday, 2 June 2022
  • End date: per conversation @MNeisler + @ppelberg had on 8 June, we will plan for the analysis to begin as early as Tuesday, 5 July 2022

Participating wikis

The list of wikis participating will be finalized once T304027 is resolved.

Hypotheses

To help evaluate the impact of Topic Subscriptions, we will analyze whether adding a way for being notified when new comments are added to discussions you have an expressed an interest in causes the following...

IDHypothesisMetric(s) for evaluation
KPIContributors, across experience levels, will respond more quickly to new topics and comments that are posted on Wikitext talk pages because they will be made aware of these new comments and topics in real-time.For all comments and new topics with a response, the average time duration from "Person A" posting on a talk page and "Person B" posting a response, grouped by experience level of "Person A". Reason: Topic Subscriptions are designed to shorten the amount of time it takes for people to receive the input they are seeking. As such, we are measuring "waiting" time from the perspective of the person who is expecting a response, who in this scenario is "Person A."
Curiosity #1Contributors, across experience levels, will post a greater number of comments to talk pages because they will be more aware of opportunities to offer support/guidance/expertise/etc.Average number of comments or new topics posted on talk pages by individual contributors that edit a talk page, grouped by experience level. Note: if time allows, we are also curious to see the total number of comments people within the two test groups post.
Curiosity #2Contributors, across experience levels, will start a greater number of topics on talk pages because they will feel more confident the other person will respond.Percent of contributors that edit a talk page and start a new topic, grouped by the number of topics (e.g. 1-5, 6-10, 11-15, etc) they've started and experience level
Guardrail #1Topic Subscriptions should not cause a significant increase in disruptive behavior.A) Sharp increase in the number of notifications sent/contributor/day and B) Sharp increase in the percent of contributors that disable notifications
Guardrail #2Topic Subscriptions should not cause a sharp increase in the amount of time it takes for people to respond to comments and new topics that are posted to wikitext talk pages.Average time between when a new comment or new topic is published and someone responding to said comment or topic
Guardrail #3Topic Subscriptions should not cause a significant (read: sharp) increase or decrease in the number of Senior Contributors editing talk pagesPercent change in the number of Senior Contributors making edits to talk pages.

Decision matrix

IDScenarioPlan of action
1.People who have had access to topic subscriptions are "significantly" more likely to respond more quickly to new topics and comments that are posted on wikitext talk pages than people who have not had access to topic subscriptions.Continue with plans to make Topic Subscriptions available to everyone who is logged in, at all projects, by default.
2.People are "significantly" less likely to respond more to new topics and comments that are posted on wikitext talk pagesPause plans for wider deployment and prioritize an investigation in what could be contributing to the decreases in responsiveness
3.People who have had access to topic subscriptions are either "marginally" more or less likely to respond more quickly to new topics and comments that are posted on wikitext talk pages than people who do not have access to topic subscriptions.Barring any significant negative qualitative feedback about topic subscriptions, continue with plans to the feature to everyone who is logged in, at all projects, by default

Done

  • A report is published that evaluates the ===Metrics listed above
  • A decision is made and documented about how the analysis's results will impact plans to offer topic subscriptions to more people at more projects.

References


Notes: In T290508, we decided that as part of this A/B test of Topic Subscriptions, we will look at whether there has been a stark decrease in the percentage of main namespace edits as a proxy for whether Topic Subscriptions are decreasing peoples' engagement with their Watchlists and thus, causing them to be less likely to notice and undo vandalism.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
Openppelberg
DeclinedNone
DeclinedNone
ResolvedMNeisler
ResolvedMNeisler
ResolvedWhatamidoing-WMF
ResolvedDLynch
ResolvedDLynch
Resolvedppelberg
Resolvedmpopov
ResolvedMNeisler
ResolvedDLynch
ResolvedWhatamidoing-WMF
ResolvedRyasmeen
Resolvedppelberg
Resolvedppelberg
ResolvedDLynch
Resolvedppelberg
ResolvedDLynch
Resolvedppelberg
Resolvedppelberg
Resolvedmatmarex
ResolvedBUG REPORTmatmarex
Resolvedppelberg
ResolvedDLynch
ResolvedWhatamidoing-WMF
Resolvedppelberg
ResolvedSpikeMNeisler
Resolvedppelberg
ResolvedDLynch
ResolvedWhatamidoing-WMF
Resolvedppelberg
DeclinedMNeisler
ResolvedDLynch
Resolvedmatmarex
OpenNone

Event Timeline

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.
ppelberg moved this task from Untriaged to Next Quarter on the Editing-team board.

I'm not sure that I understand the "receive quicker responses" idea. Here's what happens:

  1. I post and subscribe.
  2. You respond to my post. You might or might not ping me in your response.
  3. I receive a notification about your response. (Maybe I reply to you, maybe I Special:Thank you, maybe I don't).

Is the "time to receiving the response" about step #2, or about step #3?

The fact that I subscribe to the topic in step #1 has no effect at all on when you respond in step #2. If you pinged me in step #2, then the fact that I was subscribed doesn't change anything (except which column Echo/Notifications chooses to show the notification in).

But if you're measuring for step #3, how do you figure out whether it's "quicker"? Quicker compared to what? Can we measure when people read the Talk pages they're subscribed to, even if they don't reply?

I'm not sure that I understand the "receive quicker responses" idea. Here's what happens:

  1. I post and subscribe.
  2. You respond to my post. You might or might not ping me in your response.
  3. I receive a notification about your response. (Maybe I reply to you, maybe I Special:Thank you, maybe I don't).

Is the "time to receiving the response" about step #2, or about step #3?

The fact that I subscribe to the topic in step #1 has no effect at all on when you respond in step #2. If you pinged me in step #2, then the fact that I was subscribed doesn't change anything (except which column Echo/Notifications chooses to show the notification in).

But if you're measuring for step #3, how do you figure out whether it's "quicker"?

Good questions, @Whatamidoing-WMF.

This A/B test is measuring for "step #3" by looking at the time between someone publishes a new topic or a comment on a talk page and someone else posts a response to said new topic or comment. [i]

Quicker compared to what? Can we measure when people read the Talk pages they're subscribed to, even if they don't reply?

Hmm, can you say a bit more here? Why are you thinking it might be important/necessary to know when people read the Talk pages they're subscribed to in order to arrive at the "response time" measurement I shared above?


i. Note: we've defined a "response" in the context of an A/B test as someone publishing a comment. Meaning, someone "Thanking" an edit would NOT be considered a response. @MNeisler please comment if you've been thinking about this differently.

looking at the time between someone publishes a new topic or a comment on a talk page and someone else posts a response to said new topic or comment.

How does having Alice subscribe to the new topic change how quickly Bob posts a response to said new topic or comment?

This is the planned process, right?

  1. Alice posts. Alice is subscribed.
  2. Bob, who is not subscribed, responds to Alice.
  3. Alice receives a notification about Bob's response.

Proposed measurement: How quickly does Bob (who is not subscribed) reply to Alice? Right?

looking at the time between someone publishes a new topic or a comment on a talk page and someone else posts a response to said new topic or comment.

How does having Alice subscribe to the new topic change how quickly Bob posts a response to said new topic or comment?

This is the planned process, right?

  1. Alice posts. Alice is subscribed.
  2. Bob, who is not subscribed, responds to Alice.
  3. Alice receives a notification about Bob's response.

Proposed measurement: How quickly does Bob (who is not subscribed) reply to Alice? Right?

@Whatamidoing-WMF: I talked about this with @MNeisler today who shared language with me that I think will resolve the ambiguity you are referring to in T280897#7807095 and T280897#7841688.

Please ping Megan if the below leaves any questions open in your mind.

What Megan shared
Sherry, as you noted in T280897#7841688, we expect there to be cases where we do NOT expect Topic Subscriptions to impact the speed at which people respond to each other. [i]

We are okay with the above because this A/B is interested in learning what overall impact Topic Subscriptions has on the rates at which people receive responses to things they say on the wiki. Inherently, this means the A/B test will look at many kinds of interactions, not just the interaction you've described in this task to date.


i. E.g. In a case where Person A starts a new topic, we would NOT expect the presence of Topic Subscriptions to increase the speed at which Person B responds. At least not before T263821 is implemented...

Note: I've updated the task description's language in an effort to resolve the ambiguities surfaced in T280897#7807095 and T280897#7841688.

MNeisler triaged this task as Medium priority.
MNeisler added a project: Product-Analytics.
MNeisler moved this task from Triage to Current Quarter on the Product-Analytics board.

I reviewed the number AB test events logged to date and confirmed that there is sufficient data to go ahead and begin the AB test. See a summary of AB sessions and events logged to date (24 June 2022) since the deployment of the AB test on 2 June 2022.

AB Test Distinct Users and Edit Attempts Across all Participating Wikis

experiment_groupdistinct usersedit attempts
control17750120065
test17535117167

I also reviewed the number of AB test users that posted comments on talk pages as that will be the primary data points reviewed in the analysis.

Talk Page Comments and Responses Posted by AB test users:

test_grouptopic_typeusersattempts
controlcomment205309
controlresponse19009382
controltopic20356442
testcomment222340
testresponse19118886
testtopic20206132

Per Wiki Users and Edit Sessions
I also confirmed there are sufficient event logged on a per wiki basis currently except for the following 3 smaller wikis:

  • Omwiki (only 1 test user)
  • Amwiki (5 test users, 2 control users)
  • Arzwiki (19 test users, 23 control users)

Based on the current rate of events logged for these wikis, it is unlikely we will have sufficient events for these wikis even if we run the test for an additional 2 weeks. I recommend data for these 3 wikis be reviewed as part of the overall analysis but excluded from the per wiki analysis.

Moving this to task to doing
CC @ppelberg

I reviewed the number AB test events logged to date and confirmed that there is sufficient data to go ahead and begin the AB test.

Excellent.

Based on the current rate of events logged for these wikis, it is unlikely we will have sufficient events for these wikis even if we run the test for an additional 2 weeks. I recommend data for these 3 wikis be reviewed as part of the overall analysis but excluded from the per wiki analysis.

+1 to what you're proposing, @MNeisler: starting the analysis now – as opposed to waiting for an additional two weeks – sounds good to me.

Below is a summary of some preliminary results from the Topic Subscription AB Test analysis. The full report will be provided pending further analysis and QA.

KPI: Average time duration between post and response

Definition: For all comments and new topics with a response, the average time duration from "Person A" posting on a talk page and "Person B" posting a response, grouped by the experience level of "Person A".
Data: For all comments and topics posted by a user in the AB test with a response, I calculated the time difference between the comment and the response. Source: talk_page_edit and editattemptstep.
Note: We do not know if all of the users in the AB test were actively subscribed to the topic at the time of their comment or response but are interested in the overall impact Topic Subscriptions has on the response time rates for each test group.

Summary statistics of response times across all participating Wikipedias

experiment_groupmedian_response_time (mins)mean_response_time (mins)25 Percentile( mins)50 Percentile (mins)75 Percentile (mins)
control9029196902713
test3918602392620

median_time_response_overall (1).png (2×4 px, 146 KB)

There was 51 minute decrease in the median[1] response time for the test group compared to the control group.
[1] We see a decrease in both the average and median response times for the test group; however, since the time response data is highly skewed by outliers (instances where the user took several days to respond), I recommend using on the median instead of the avearge as a better indicator of the typical response time.

By experience group:

experiment_groupexperience_groupMedian response time (mins)Average response time (mins)
control0-100 edits62249
test0-100 edits3893
control101-500 edits347532
test101-500 edits551421
controlover 500 edits902240
testover 500 edits3112391

When we split by experience group, the results are much more varied. We see a significant decrease in median response times to Junior Contributors that posted a comment in the test group but an increase for Senior Contributors that posted a comment. Differing trends are seen for averages. Further investigation is likely needed to clarify these results.

Curiosity #1: Average number of comments or new topics posted on talk pages by individual contributors that edit a talk page, grouped by experience level.

Total number of users and comments posted in each experiment group

experiment_groupn_usersn_comments
control499027258
test502126880

The average number of comments posted by a Contributor during the duration of the AB test

experiment_groupexperience_groupavg_comments
control0-100 edits2.04
control101-500 edits3.80
controlover 500 edits10.53
test0-100 edits2.12
test101-500 edits3.78
testover 500 edits10.12

We do not see a significant difference in the total number of users and comments posted in each experiment group or the average number of comments posted in each group.

Guardrail #3:Percent change in the number of Senior Contributors making edits to talk pages.

We see less than a 1% decrease in the number of distinct senior contributors following the deployment of the AB test.

Senior_contrib_daily_edits.png (597×979 px, 69 KB)

See draft report for queries and further details.

cc @ppelberg

@ppelberg Here is the complete draft report for review.

Summaries for additional completed guardrails and curiosities below:

Curiosity 2: Percent of contributors that edit a talk page and start a new topic

  • We did not observe any significant differences in the percent of contributors that edit a talk page and start a new topic. Of the talk page contributors that made an edit, a slightly higher percentage of test group contributors started a new topic ( 38.7% → 38.8%; 0.4% ↑). Note: This includes all contributors that made an edit to a talk page including corrective edits.
  • There are also no significant differences in the number of comments posted on a talk page by the bucketed users in the test and control groups. For both groups, the majority of contributors (60% in each experiment group) posted only one comment during the duration of the AB test.

topic_contributors_bytopic.png (600×925 px, 38 KB)

topic_contributors_group_byexp.png (548×1 px, 64 KB)

Guardrail #1: Topic Subscriptions should not cause a significant increase in disruptive behavior

Sharp increase in the number of notifications sent/contributor/day

  • The average daily notifications has increased from about 3.6 to 4.2 notifications per contributor day following the AB test (16% increase). This increase was not significant or sudden to indicate any disruption.

avg_daily_topic_notifications.png (2×4 px, 253 KB)

Sharp increase in the percent of contributors that disable notifications
Method:

  • Manual Topic Subscriptions: Following the deployment of the AB test, there have been 888 users across all participating wikis that have manually subscribed to a topic. None of those users have explicitly disabled that preference as of the end of the AB test (15 July 2022).

In comparison, 0.72% of users that manually subscribed to a topic on participating wikis prior to the AB test disabled that preference.

  • Automatic Topic Subscriptions:

About 19% of users on participating wikis that were auto subscribed to a topic following the deployment of the AB test disabled the feature.

No users in the AB test were automatically subscribed to a topic prior to the deployment of the AB test; however, this percentage is similar to the overall rates of users that disabled the automatic topic notification preference found in the adoption metrics report (18%).

Guardrail #2: Average time between when a new comment or new topic is published and someone responds to said comment or topic

As observed in the response times identified in the KPI section above, we did not observe any sharp increase in the average time it takes for people to respond to comments and new topics that are posted to wikitext talk pages.

  • The median response time in the test group was 57% faster than the median response time in the control group.
  • In addition, there were fewer long (over 10 days) response times in the test group compared to the control group.

We did observe a few experience-level groups (Senior Contributors) and wikis where there was a higher median response time to comments and topics posted in the test group compared to the control group. Further investigation is needed to help clarify the source of these differences.

Additional KPI insights

  • There were fewer observed instances long response times (over 10 days) in the test group. In the control group, 4.5% of responses were provided over 10 days (240 hours) after the initial comment while in the treatment group only 1.4% of responses were provided after 10 days.

response_time_histogram.png (2×4 px, 116 KB)

  • Results also vary on a per Wikipedia basis with half of the participating wikis having a significant decrease in median response times and the other half having significant increases.
    • Wikis with decreases in median response times: Spanish Wikipedia, Japanese Wikipedia, Vietnamese Wikipedia, French Wikipedia.
    • Wikis with increases in median response times: Portuguese Wikipedia, Persian Wikipedia, Italian Wikipedia, and Hebrew Wikipedia.

Further investigation may be needed to clarify these results.

Per the conversation @MNeisler and I had offline today, the results Megan shared in T280897#8129320 look great.

Next steps
The one remaining question we converged on answering is the following: "What percentage of comments get a response?"

MNeisler moved this task from Doing to Needs Sign-off on the Product-Analytics (Kanban) board.

The one remaining question we converged on answering is the following: "What percentage of comments get a response?"

Method: I reviewed data logged in EditAttemptStep and talk_page_edit to determine the percent of comments posted by users in the AB test that received a response. Comments with responses were identified by looking for any comment id that was also once a comment_parent_id (indicating a response to a new comment) or any comment_parent_id that was also once a topic_id (indicating a response to a topic).

Results:

Overall across all participating Wikipedias

Experiment GroupNumber of commentsNumber of comments with responsePecent of comments with response
control21289646130.35%
test20821634330.46%

Overall, there was only a very slight (0.36%) increase in the percent of comments with a response in the test group when looking at comments posted by all users in the AB test.

By Contributor's Experience Level

response_pct_byexp.png (599×1 px, 65 KB)

When broken down by the user's experience level, we see more of an impact of access to topic subscriptions when looking at the the percentage of comments posted by non-Senior Contributors (Contributors with under 500 cumulative edits) with a response.

  • There was a 8.6% increase in the percent of comments posted by Junior contributors (under 100 edits) with a response.
  • There was 15.3% increase in the percent of comments posted by Contributors with 101-500 edits with a response.
  • In contrast, there was a slight -1.8% decrease in the percent of comments posted by Contributors with over 500 edits.

@ppelberg - Reassigning this to you for review and sign-off. Please let me know if you have any questions or suggestions for additional metrics or breakdowns to review.

Full Report