Page MenuHomePhabricator

[MILESTONE] Run an A/B test to evaluate impact of Tone Check
Closed, ResolvedPublic

Description

This task involves the work of running an A/B test of the Tone Check.

Timeline

DescriptionDelivery dateResponsibleStatus
Announce A/B test at pt.wiki (T395154)Thursday, 21 August@Trizek-WMF✅ Done
Announce A/B test at ja.wiki (T395154)Monday, 25 August@Trizek-WMF✅ Done
Complete pre-deployment QA (T393817)Tuesday, 26 AugustEditing QA✅ Done
Deploy config to start A/B test at fr, ja, and pt (T389231)Wed., 3 SepEditing Engineering✅ Done
Verify bucketing instrumentation (T394952)8 Sep 2025Editing QA + @MNeisler✅ Done
Verify test bucket balancing (T395090 )8 Sep 2025Editing QA + @MNeisler✅ Done
Publish leading indicator analysis (T395090 )24 Sep 2025ML + @MNeisler✅ Done
Deploy config change to STOP the Tone Check A/B experiment, and announce the end of the experiment (T411914 )dateEditing Engineering + @Trizek-WMF
Begin final analysis// leading indicator analysis ready for discussion (T395090 )1 Dec 2025@MNeisler
Complete final analysis// leading indicator analysis ready for discussion (T395090 )17 Dec 2025@MNeisler

Overarching hypothesis

If we prompt newcomers and junior contributors to reconsider the tone they are writing in when software detects them using – what experienced volunteers would agree is – then non-neutral/peacock language, then we will decrease the percentage of new content edits newcomers publish that are reverted on the grounds of WP:NPOV (and related policies).

Decision to be made

This A/B test will help us make the following decision:

What – if any – changes in the Tone Check UX, and/or the model that enables it, will we make before we can be confident in the following...?

  1. Newcomers and Junior Contributors that encounter Tone Check are more likely to publish new content edits in the main namespace that are devoid of biased language.
  2. Newcomers and Junior Contributors will intuitively interact with the Tone Check experience in ways that are NOT disruptive to them or the wikis

Open questions

  • 1. What – if any – ceiling will we place on the number of Tone Checks that people can see within a given edit session?
    • No ceiling will be placed on the number of Tone Checks shown. We'll learn whether such a constraint might be necessary via Guardrail Metric #2 listed below.

KPIs

The main outcomes we are trying to impact through this feature. These are what we are primarily using for evaluating the hypothesis and deciding whether to deploy an intervention more widely.

IDHypothesisDecision(s) to be madeMetric(s) for evaluation
KPIThe quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral languageDoes showing people a prompt when using non-neutral language lower the likelihood that new content edits include non-NPOV?1) Proportion of all new content edits published without biased language and 2) Proportion of new content edits that are not reverted.
KPINewcomers and Junior Contributors will experience Tone Check as encouraging because it will offer them more clarity about what is expected of the new information they add to WikipediaDoes showing people a prompt discourage them from publishing would-be quality edits?Proportion of new content edits started (defined as reaching point that Tone Check was or would be shown) that are successfully published (not reverted).

Secondary metrics

Used to learn about additional impact of Tone Check, but are not primary targets of the intervention. They reveal side effects (both positive and negative) of trying to improve the Primary Metric with the intervention.

IDHypothesisMetric(s) for evaluation
Curiosity #1New account holders will be more likely to publish an unreverted edit to the main namespace within 24 hours of creating an account because they will be made aware the new text they're attempting to publish needs to be written in a neutral tone, when they don't first think/know to write in this way themselvesConstructive activation. Note: we'd need to break this out by platform. Reason, WE 1.2 is scoped to mobile-only.
Curiosity #2Newcomers and Junior Contributors will be more aware of the need to write in a neutral tone when contributing new text because the visual editor will prompt them to do so in cases where they have written text that contains non-neutral language.The proportion of newcomers and Junior Contributors that publish at least one new content edit that does not contain non-neutral language. See T388716.
Curiosity #3Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that does not include non-neutral language because Tone Check will have caused them to realize when they are at risk of of this not being true.1) Proportion of newcomers and Junior Contributors that publish an edit Tone Check was activated within and successfully return to make an unreverted edit to a main namespace during the identified retention period., 2) Proportion of newcomers and Junior Contributors that publish an edit Tone Check was activated within and return to make a new content edit without non-neutral language to a page in the main namespace during the identified retention period.
Curiosity #4Knowing the reasons why people do not elect to revise tone when the Check prompts them to do so (by platform), will help us to decide what (if anything) can be done to decrease the proportion of people on desktop who do so. //See discrepancy in dismissal rates by platform in leading indicator analysisDistribution of decline reasons grouped by platform and experience level

Leading indicators

T394463

Guardrails

Used to make sure that the new checks presented are not negatively impacting an editor’s experience completing an edit or causing disruption on the wikis. The scenarios named in the chart below emerged through T325851.

IDNameMetric(s) for EvaluationNotes
1)Edit quality decrease (T317700)Proportion of published edits that add new content and are still reverted within 48hours. Note: Will include a breakdown of the revert rate of published new content edit edits with and without non-neutral language.
2)Edit completion rate drastically decreasesProportion of new content edits started (defined as reaching point that Tone check was or would be shown) that are published. Note: Will include breakdown by the number of checks shown to identify if lower completion rate corresponds with higher number of check shown.
3)Edit abandonment rate drastically increasesProportion of contributors that are presented Tone Check and abandon their edits (indicated by event.action = abort and event.abort_type = abandon).We'd like to look at how abandonment rate varies by # of Checks shown in the context of this finding from the leading indicators analysis: "The revert rate of edits in which multiple Checks are shown is higher (24.7%) than edits in which a single Check is shown (13.8%)."
4)People shown Tone Check are blocked at higher ratesProportion of contributors blocked after publishing an edit where Tone Check was shown compared to contributors not shown the Tone Check
5)High false positive rateProportion of contributors that decline revising the text they’ve drafted and indicate that it was irrelevant.

A/B Test: Decision Matrix

IDScenarioIndicator(s)Plan of Action
1Tone Check is disrupting, discouraging, or otherwise getting in the way of volunteers. Read: people are less likely to publish the edits they start.≥20% decrease in edit completion rate in edit sessions where Tone Check is activated relative to edit sessions where Tone Check is not activated.Pause scaling plans; If results indicate that significant decreases are only associated with a high number of edit checks shown, set a threshold for the maximum number checks that can be shown within a single session. If we observe significant decreases for both single and multiple checks presented in a single session, investigate changes to the UX.
2Tone Check is increasing the likelihood that people will publish destructive edits.Increase in the proportion of published new content edits where Tone Check was activated that are reverted within 48 hours relative to edits that would have been shown Tone Check but were not. Increase in proportion of contributors blocked after publishing an edit where Tone Check is shown compared to contributors not shown Tone Check.Pause scaling plans, Review edits to try to identify any patterns in abuse and propose changes to UX to mitigate them.
3Tone Check is causing people to publish edits that align with project policies and that are not reverted.Increase in the proportion of new content edits Tone Check was activated within that were published without biased language and are not reverted within 48 hours relative to edits that would have been shown Tone Check but were not.Move forward with scaling plans
4Tone Check is effective at causing people to publish new content edits without biased language, but those edits are still reverted.Increase in the proportion of new content edits Tone Check was activated within that were published without biased language AND increase in the proportion of these edits that are reverted within 48 hours relative to edits that would have been shown Tone Check but were not.Pause scaling plans; Further investigation into tools used to detect biased language (e.g. might the false negative rate be too high); Analysis and manual review of reverted edits to understand why those edits were still reverted.
5Tone Check is not effective at causing people to publish new content edits without biased language but is not disrupting to volunteers.No change or decrease in the proportion of new content edits Tone Check was activated within that were published without biased language AND A) no significant drop in edit completion rate or B) no significant spike in block or revert rates.Pause with scaling plans in order to investigate what could explain Tone Check having a null effect

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
Openppelberg
OpenEsanders
ResolvedMNeisler
Resolvedppelberg
ResolvedDLynch
Resolvedppelberg
ResolvedTrizek-WMF
Resolveddchan
Resolvedisarantopoulos
Resolvedppelberg
Resolvednayoub
DuplicateNone
Resolvedppelberg
ResolvedBUG REPORTgkyziridis
ResolvedDLynch
Resolvedppelberg
OpenNone
DuplicateNone
ResolvedTrizek-WMF
ResolvedBUG REPORTppelberg
ResolvedBUG REPORTEsanders
Resolvedppelberg
In ProgressSucheta-Salgaonkar-WMF
ResolvedSucheta-Salgaonkar-WMF
Resolvedppelberg
Resolveddchan
ResolvedMNeisler
ResolvedDLynch
ResolvedDLynch
ResolvedTrizek-WMF
ResolvedMNeisler
ResolvedMNeisler
ResolvedEBlackorby-WMF
Resolvedppelberg
ResolvedDLynch
ResolvedEsanders
ResolvedDLynch
OpenNone
ResolvedTrizek-WMF
Resolvedzoe
ResolvedBWojtowicz-WMF
ResolvedBUG REPORTDLynch
ResolvedDLynch
ResolvedBUG REPORTDLynch
ResolvedBUG REPORTDLynch
Resolvedppelberg
ResolvedFeatureppelberg
Resolvedzoe
Invaliddchan
ResolvedDLynch
ResolvedQuiddity

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)

I've completed the Tone Check AB test analysis for review. Please see a summary of some key findings below. Additional metrics and data results are available in the full report.

Summary of Results

New content edits published without biased language

  • Tone Check successfully decreases the frequency of non-neutral language in published content. People with access to Tone Check were -15.6% less likely to publish edits containing non-neutral language (falling from 9.6% to 8.1%; a -1.5 pp decrease) compared to the control group. This result is statistically significant and we have 99.8% confidence that this improvement is directly attributable to the tool.
  • Tone Check’s level of impact depends heavily on the platform. Results confirm a highly significant impact on Desktop, where we observed the highest decrease in non-neutral edits (-21% decrease). In contrast, there was no detectable effect yet on Mobile Web.

New content edits revert rate

  • Edits made by users shown Tone Check are also 15% less likely to be reverted than eligible control edits (29.5% → 25.1%; a -4.4 pp decrease).
  • This decrease is primarily driven by Junior Contributors on desktop. While we observed a statistically significant -33% relative [-10.2 pp] decrease in reverts for Junior Contributors, we did not confirm any change for in the revert rate of newcomers (users completing their first edit on Wiki) or unregistered users. These trends indicate that Tone Check may be more effective for people who have already succeeded in completing at least one edit on a Wikipedia namespace. Since these users are more experienced, their edits are less likely to be reverted for other policy violations compared to registered users completing their first edit or unregistered users.

New Content edit revert rate: impact of removing non-neutral language

  • We also wanted to evaluate the impact of people successfully addressing the Tone Check prompt, as a portion of users (~37%) shown Tone Check decline it.
  • When a user removes non-neutral language in response to a Tone Check, the likelihood of that edit being reverted decreases significantly. Across both platforms, there was a -44.1% decrease in the revert rate for edits where the prompt was addressed. This confirms that Tone Check is highly effective at helping people identify and correct edits that would otherwise be reverted.

new_content_revert_rate_tone_addressed.jpg (540×960 px, 29 KB)

  • We observed decreases on both platforms, but there is a larger impact on desktop compared to mobile web. On desktop, we observed a significant -47% decrease [-13.4 pp] in revert rate for people who revised their text in response to Tone Check.
  • On mobile web, there was -14.8% [-4.8pp] decrease in revert rate for edits where non-neutral language was removed. Mobile web edits appear to be inherently trickier for newcomers and are still more likely to be reverted compared to desktop edits, even when non-neutral language is removed.

Edit Completion Rate

  • Tone Check does not appear to be causing any significant disruption to most people’s editing experience. Edit completion rates for people shown Tone Check decreased only slightly by -3.2% (-1.6) percentage points. This decrease was primarily concentrated on Desktop (-2.6%), with no significant change on Mobile Web.
  • While completion rates slightly decreased for newcomers and unregistered users, they slightly increased for Junior Contributors, suggesting the check is encouraging and helps a portion of people complete their edit successfully.

Constructive Edit Rate

  • Tone Check improved the rate of constructive edits by +6.2% [4.4] percentage points. We observed improvements in overall edit quality at each of the three partner Wikipedias.
  • Aligned with the revert rate findings, the magnitude of impact varies by platform. On desktop, constructive edit rate increased by +6.4% while we observed no statistically significant change in mobile web constructive edits.
  • Tone Check appears especially effective at increasing the constructive edit rate of registered Junior Contributors, where we observed a +14.8% increase [10.2 pp] in constructive edit rates.

contructive_edit_rate_overall.jpg (540×960 px, 32 KB)

Retention Rate

  • We further found that people shown Tone Check were more likely to return, indicating that the feature results in a positive editing experience for most contributors.
  • People who encountered Tone Check are 24% more likely to return again to make a constructive edit in their second week. Retention rates increased from 5.8% to 7.2% when Tone Check was shown (+1.4 percentage points).
  • We observed increases for both mobile web and desktop users and across all user types as well.
Experiment GroupEditorsRetained EditorsRetention Rate
Control (Eligible but not shown)1,9951155.8%
Test (Tone Check shown)2,3091677.2%

Guardrails. Tone check is not causing significant disruption on either desktop or mobile web based on analysis of identified guardrails. The decline rate is lower than other existing Edit Checks, and there was no spike in user blocks or revert rates.

cc @ppelberg

What – if any – changes in the Tone Check UX, and/or the model that enables it, will we make before we can be confident in the following...?

  1. Newcomers and Junior Contributors that encounter Tone Check are more likely to publish new content edits in the main namespace that are devoid of biased language.
  2. Newcomers and Junior Contributors will intuitively interact with the Tone Check experience in ways that are NOT disruptive to them or the wikis

While there are a few technical issues that we've uncovered since the A/B experiment concluded [1], the A/B experiment is causing us to be confident Tone Check had a unambiguously positive effect. As such, we will move forward with plans to scale the feature to all Wikipedias that participated in the A/B experiment and additional wikis as the model is verified to support them.

See full findings and conclusions on mediawiki.org: https://www.mediawiki.org/wiki/Edit_check/Tone_Check#Findings.


  1. E.g. T418173