Page MenuHomePhabricator

Run A/B test to evaluate impact of Reply tool
Closed, ResolvedPublic

Description

This test is intended to help us understand what impact the Reply Tool is having on Junior Contributors' likelihood to start (activation) and continue (retention) participating on Wikipedia talk pages.

Decision to be made

The decision this analysis is intended to help us make:
Should the Reply tool be offered to all people, at all wikis, as an opt-out user preference?

Hypotheses

To help evaluate the impact of the Reply tool, we would like to analyze whether adding a more intuitive workflow for replying to specific comments to Wikipedia talk pages:

IDHypothesisMetric(s) for evaluation
KPI...causes a greater percentage of Junior Contributors to publish the comments they start without a significant increase in disruption. (see "Guardrail" below)Comment completion rate as defined by the number of people who click the [ reply ] link (action = init), what % of people successfully publish the comment they were drafting (action = saveSuccess).
Guardrail...does not cause a significant increase in the number of disruptive edits being made to talk pagesThe number of edits made to talk pages that are reverted within 48 hours. The number of editors who are blocked after making an edit to a talk page.
Curiosity #1...causes a greater number of Junior Contributors to start participating productively on talk pages.The number of distinct Junior Contributors who make at least one edit to a page in a talk namespace that is not reverted within 48 hours.
Curiosity #2...causes a greater percentage of Junior Contributors continue participating productively on talk pages.The percentage of Junior Contributors who who make at least one edit to a page in a talk namespace that is not reverted within 48 hours in each of the following time intervals: 2 to 7 days after making their edit (read: within the first week), 8 to 14 days after making their first edit (read: within the second week), and 15 to 30 days after making their first edit (read: within the third or fourth weeks).

Decision matrix

IDScenarioPlan of action
1.People are "meaningfully" more likely to publish edits using the Reply Tool than they are using full-page editingContinue with plans to make the Reply Tool available at all Wikipedias, by default. See T269062 for more detail.
2.People are "meaningfully" less likely to publish edits using the Reply Tool than they are using full-page editingInvestigate where within the Reply Tool comment funnel people are dropping off and what could be contributing to this drop-off. In parallel, we will pause plans to make the Reply Tool available at all Wikipedias by default.
3.People are as likely to publish edits using the Reply Tool as they are using full-page editingContinue with plans to offer features as opt-out preference at all Wikipedias considering we have meaningful qualitative feedback and quantitative data that suggests the tool is leading people to find participating on talk pages easier / more efficient.[ii]

Open questions

  • 1. Should edits to non-talk namespace pages be included in this analysis?
  • 2. What wikis should be included in the test? See: T267379.

Done


i. Editor experience buckets

  • Logged out
  • 0 cumulative edits
  • 1-4 cumulative edits
  • 5-99 cumulative edits
  • 100-999 cumulative edits
  • 1000+ cumulative edits

ii. An example of said "quantitative data": T247139

Related Objects

Event Timeline

Task description update
I've updated ===Hypotheses section to the task description which contains the metrics we will use to compare the two test groups and by extension, determine the impact the Reply Tool is having on Junior Contributor activation and retention.

Note: the above is the outcome of the conversation @MNeisler and I had on 4-Nov wherein we revisited the Reply Tool measurement plan and identified the metrics we will prioritize as part of this A/B test.

Task description update
I've updated the task description to reflect the updates to the test KPI @MNeisler and I decided upon during the meeting we had on 2-December.

ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)

Deployment update
The A/B test officially started today, 11-February-2021. [i]

This means the analysis can "start" as early as 25-February per the conversation @MNeisler and I had yesterday (10-February).


i. T273554#6825381

I read the A/B test has officially started. The task T273406 hasn't been updated yet. Is it okay if I inform Dutch Wikipedia?

I read the A/B test has officially started...Is it okay if I inform Dutch Wikipedia?

I'm sorry for the delayed response, @AdHuikeshoven. Yes, it is okay to inform Dutch Wikipedia.

Note: it looks like @Whatamidoing-WMF has already made an announcement at nl.wiki per T273406#6827497.

Question: should the KPI be “percentage of people” or “percentage of edits/posts”? If a person makes a mixture of successful and unsuccessful posts, do they get counted as a success or a failure overall?

You could do something funky like number of people weighted by their personal success rate. Person A posts 30 comments, all successful, is scored as a 1; person B makes two successful posts also scores 1; person C has one success and one fail scores 0.5. (It's people-focussed, and avoids the problem with edit counts where prolific commenters would skew the results.)

Or maybe you’re only interested in their first n attempts before they learn the ropes? But if someone keeps trying and gets better with time (instead of giving up) then that retention factor is important to count in.

Meta
Per the conversation, @MNeisler and I had today, we are going to break this analysis into two parts:

  1. Report on the KPIs
    • Components: KPI and Guardrail metrics defined in the Hypotheses section of the task description.
  2. Full analysis
    • Components: Curiosity metrics defined in the Hypotheses section of the task description.
MNeisler triaged this task as Medium priority.Mar 2 2021, 9:34 PM
MNeisler moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

Question: should the KPI be “percentage of people” or “percentage of edits/posts”? If a person makes a mixture of successful and unsuccessful posts, do they get counted as a success or a failure overall?

@Pelagic: you identified the part of the test design @MNeisler and I discussed most.

Ultimately, we decided to take, as you described it, a "people-focused" approach to evaluate the impact of the Reply Tool. The reason we decided to take this approach instead of an "edit-focused" focused approach was to lower the likelihood that the behavior of a small and non-representative group of people could unduly skew the test results.

I should mention that in taking the approach that we have, we will lose insight into, as you described, the number of edit attempts that comprise a given individual's success rate. We are comfortable accepting this tradeoff because, for right now, we are more interested in learning whether people reach the point of successfully posting a comment than we are in the effort (read: number of attempts) required to reach that success.

Please tell me if anything above prompts new thoughts/questions or leaves anything about what you asked in T252057#6827836 unanswered.

cc @MNeisler who can: A) correct any details I might've misconstrued and/or B) offer additional context.

Here is a quick look at the overall edit completion rate for replies on talk pages by editor type. Data is based on edit attempts by users in the AB test recorded in EditAttemptStep from 12 February 2021 through 10 March 2021. Note: This data currently reflects logged-in users across all experience levels. I will review edit completion rate by experience level when I complete the preliminary analysis on KPIs.

There's variance in the percent difference but editors using the Reply Tool have a higher completion rate across all AB participating wikis compared to the use of existing reply workflows ("Page Editing").

I am currently doing some research into different models to correctly infer the impact of the reply tool on these completion rates. This model can help take into account random effects due to the user and wiki as well as variables such as the user experience.

I'll follow-up with details regarding the estimated impact and further insights when available.

Overall Edit Completion Rate by Reply Editor Type

reply_typen_usersn_users_completedcompletion_rate
Page Editing2646136151%
Reply Tool2037140069%

Edit Completion Rate by Participating Wiki and Reply Editor Type

wikireply_typen_usersn_users_completedcompletion_rate
afwikiPage Editing4375%
afwikiReply Tool11100%
arzwikiPage Editing8225%
arzwikiReply Tool3267%
bnwikiPage Editing281657%
bnwikiReply Tool181478%
eswikiPage Editing42319847%
eswikiReply Tool32223071%
fawikiPage Editing1347153%
fawikiReply Tool895157%
frwikiPage Editing39422858%
frwikiReply Tool42232577%
hewikiPage Editing19911960%
hewikiReply Tool1046260%
hiwikiPage Editing27726%
hiwikiReply Tool13646%
idwikiPage Editing641625%
idwikiReply Tool251560%
itwikiPage Editing39822055%
itwikiReply Tool35523967%
jawikiPage Editing2189744%
jawikiReply Tool1216755%
kowikiPage Editing411741%
kowikiReply Tool19947%
nlwikiPage Editing1095651%
nlwikiReply Tool776584%
plwikiPage Editing1487953%
plwikiReply Tool1267257%
ptwikiPage Editing1446545%
ptwikiReply Tool17914279%
swwikiPage Editing3267%
swwikiReply Tool11100%
thwikiPage Editing191053%
thwikiReply Tool9778%
ukwikiPage Editing1147566%
ukwikiReply Tool503468%
viwikiPage Editing532242%
viwikiReply Tool231252%
zhwikiPage Editing1226049%
zhwikiReply Tool834655%

Notes:
(1) Reply Tool events sampled at 100%. Page Editing events (event.integration = 'page' ) sampled at a rate of 1/16, or 6.125%
(2) Excludes edits to create new sections or pages.

Here is the draft report on the KPIs and guardrails for the Reply Tool AB Test for review.

A few highlights and key findings from the preliminary analysis:

  • Overall, across all participating Wikipedias, Junior Contributors had a significantly higher comment completion rate using the reply tool compared to using non-reply tool workflows. 72.9% of all Junior Contributors that made a comment attempt were able to successfully publish at least 1 comment with the reply tool, while only 27.6% of Junior Contributors successfully saved a non-reply tool comment. This represents a 164% observed increase in comment completion rate.
  • On a per Wikipedia basis, the percent increases vary; however, Junior Contributors had a higher comment completion rate using the reply tool compared to non-reply tool editor interfaces on every participating Wikipedia. Indonesian, Japanese, Dutch, and Spanish Wikipedias saw the highest percent increases in comment completion rates with the reply tool. We observed the two lowest percent increases in the comment completion rate for Persian (42% increase) and Hebrew Wikipedias (55% increase). Note: These are both right-to-left languages, which might be worth exploring as a potential reason for the lower increases observed on this wikis.
  • To infer the impact of the reply tool on these comment completion rates, we used a Hierarchical Regression Model which accounts for any random effects due to the user and wiki. Based on estimates from the model, we found that there is an average 45.5% increase (maximum 49.4% increase) in the probability of a Junior Contributor publishing a comment when they use the reply tool instead of a non-reply tool editing interface.
  • I'm currently looking into the impact of experience level on the use of the reply tool. When looking at comment completion rates across all contributors' experience levels, the comment completion rates using the reply tool were not too different from the rates found for Junior Contributors. Overall, 68.9% of contributors across all experience levels were able to publish at least one comment using the reply tool. However, experience level appears to have a much more significant impact on the ability of a Contributor to publish a comment using non-reply tool methods. 57.8% by Contributors across all experience levels were able to complete at least one comment using non-reply tool editing interfaces, compared to only 27.6% of Junior Contributors. Further analysis will help quantify and confirm the impact.
  • Guardrail Analysis: Initial data does not indicate any significant disruption caused by the reply tool. For Junior Contributors using the reply tool, 1.65% of their comments were reverted within 48 hours and only 1.81% were blocked. (NOTE: The guardrail analysis metrics rely on data available in mediawiki_history which is updated monthly. At the time of this analysis, March 2021 editing attempts had not yet been logged so the data in the section below reflects edits recorded from the start of the AB Test on 2021 February 12 through the end of February. I will update the data when available but do not anticipate any significant changes to the reported metrics.

Remaining TODOs for Final Report:

  • Update Guardrail Analysis on March 2021 snapshot of mediawiki_history is available.
  • Complete analysis of completion rates across all contributors experience levels
  • Complete analysis of the curiosity metrics defined in the Hypotheses section of the task description.
  • Final cleanup of the report for publishing: Formatting of charts and tables, finalize data synposis, add a high-level summary of findings, etc.

@ppelberg - Please let me know if you have any questions or changes.

Source Code File

To show I read your report, in fifth bullet it reads "reflect sedits" in stead of "reflects edits". Nice results!

Here is the updated report.

Some key insights and conclusions:

  • Junior contributors had a much higher comment completion rate using the reply tool compared to page editing.
  • Using a regression model, we confirmed there is an average 45% increase in the probability of a Junior Contributor publishing a comment when they switch from using page editing to the reply tool. This model accounted for any random effects by the user and the wiki.
  • We found experience level has a significant effect on the comment completion rate of a contributor. The comment completion rate for Junior Contributors (defined as having under 100 edits) using page editing is significantly lower than the comment completion rate observed for non-junior contributors (defined as having over 100 edits) using page editing. However, using the reply tool, Junior contributors' comment completion rate was roughly the same as the Non-Junior contributors' comment completion rate using page editing
  • Overall, across all participating Wikipedia, we observed a 79.5% decrease in the revert rate for comments Junior Contributors made with the reply tool compared to page editing. The reply tool seems to enable Junior Contributors to not only successfully complete a comment but reduce the number of errors in the published comment that might lead to the comment being reverted.
  • In addition to a decrease in revert rate, under 2% of Junior Contributors using the reply tool were blocked after making a comment on a talk page indicating that the tool did not result in any significant increase in disruptive edits to talk pages.

@ppelberg - Please feel let me know if you have any questions.

Codebase

Here is the updated report.

Excellent

  • Overall, across all participating Wikipedia, we observed a 79.5% decrease in the revert rate for comments made with the reply tool compared to page editing. The reply tool seems to enable Junior Contributors to not only successfully complete a comment but reduce the number of errors in the published comment that might lead to the comment being reverted.
  • @MNeisler: to be doubly sure, would it be more accurate to say, "...we observed a 79.5% decrease in the revert rate for comments Junior Contributors made..." vs. "...we observed a 79.5% decrease in the revert rate for comments made..."?

@ppelberg - Please feel let me know if you have any questions.

Before resolving this task, can you please review the update I've posted on the Reply Tool project page [i] to ensure it is accurate?


i. https://www.mediawiki.org/w/index.php?title=Talk_pages_project/Replying&type=revision&diff=4542988&oldid=4541142&diffmode=source

@MNeisler: to be doubly sure, would it be more accurate to say, "...we observed a 79.5% decrease in the revert rate for comments Junior Contributors made..." vs. "...we observed a 79.5% decrease in the revert rate for comments made..."?

Yes, that's correct. I've added text to that statement to clarify.

Before resolving this task, can you please review the update I've posted on the Reply Tool project page [i] to ensure it is accurate?

Yes, I'll plan to review the update on Monday and post an update to this ticket once complete.

@ppelberg - I've reviewed the Reply Tool project page and made some edits [i] to clarify the results.


i. https://www.mediawiki.org/w/index.php?title=Talk_pages_project/Replying&diff=4546318&oldid=4542988

Great – thank you, @MNeisler. And with this edit, I think this task can be resolved.