Page MenuHomePhabricator

[MILESTONE] Run an A/B test to evaluate impact of Tone Check
Open, Needs TriagePublic

Description

This task involves the work of running an A/B test of the Tone Check.

Timeline

DescriptionDelivery dateResponsibleStatus
Announce A/B test at pt.wiki (T395154)Thursday, 21 August@Trizek-WMF✅ Done
Announce A/B test at ja.wiki (T395154)Monday, 25 August@Trizek-WMF✅ Done
Complete pre-deployment QA (T393817)Tuesday, 26 AugustEditing QA✅ Done
Deploy config to start A/B test at fr, ja, and pt (T389231)Wed., 3 SepEditing Engineering✅ Done
Verify bucketing instrumentation (T394952)8 Sep 2025Editing QA + @MNeisler✅ Done
Verify test bucket balancing (T395090 )8 Sep 2025Editing QA + @MNeisler✅ Done
Publish leading indicator analysis (T395090 )24 Sep 2025ML + @MNeisler✅ Done
Begin final analysis// leading indicator analysis ready for discussion (T395090 )1 Dec 2025@MNeisler
Complete final analysis// leading indicator analysis ready for discussion (T395090 )17 Dec 2025@MNeisler

Overarching hypothesis

If we prompt newcomers and junior contributors to reconsider the tone they are writing in when software detects them using – what experienced volunteers would agree is – then non-neutral/peacock language, then we will decrease the percentage of new content edits newcomers publish that are reverted on the grounds of WP:NPOV (and related policies).

Decision to be made

This A/B test will help us make the following decision:

What – if any – changes in the Tone Check UX, and/or the model that enables it, will we make before we can be confident in the following...?

  1. Newcomers and Junior Contributors that encounter Tone Check are more likely to publish new content edits in the main namespace that are devoid of biased language.
  2. Newcomers and Junior Contributors will intuitively interact with the Tone Check experience in ways that are NOT disruptive to them or the wikis

Open questions

  • 1. What – if any – ceiling will we place on the number of Tone Checks that people can see within a given edit session?
    • No ceiling will be placed on the number of Tone Checks shown. We'll learn whether such a constraint might be necessary via Guardrail Metric #2 listed below.

KPIs

The main outcomes we are trying to impact through this feature. These are what we are primarily using for evaluating the hypothesis and deciding whether to deploy an intervention more widely.

IDHypothesisDecision(s) to be madeMetric(s) for evaluation
KPIThe quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral languageDoes showing people a prompt when using non-neutral language lower the likelihood that new content edits include non-NPOV?1) Proportion of all new content edits published without biased language and 2) Proportion of new content edits that are not reverted.
KPINewcomers and Junior Contributors will experience Tone Check as encouraging because it will offer them more clarity about what is expected of the new information they add to WikipediaDoes showing people a prompt discourage them from publishing would-be quality edits?Proportion of new content edits started (defined as reaching point that Tone Check was or would be shown) that are successfully published (not reverted).

Secondary metrics

Used to learn about additional impact of Tone Check, but are not primary targets of the intervention. They reveal side effects (both positive and negative) of trying to improve the Primary Metric with the intervention.

IDHypothesisMetric(s) for evaluation
Curiosity #1New account holders will be more likely to publish an unreverted edit to the main namespace within 24 hours of creating an account because they will be made aware the new text they're attempting to publish needs to be written in a neutral tone, when they don't first think/know to write in this way themselvesConstructive activation. Note: we'd need to break this out by platform. Reason, WE 1.2 is scoped to mobile-only.
Curiosity #2Newcomers and Junior Contributors will be more aware of the need to write in a neutral tone when contributing new text because the visual editor will prompt them to do so in cases where they have written text that contains non-neutral language.The proportion of newcomers and Junior Contributors that publish at least one new content edit that does not contain non-neutral language. See T388716.
Curiosity #3Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that does not include non-neutral language because Tone Check will have caused them to realize when they are at risk of of this not being true.1) Proportion of newcomers and Junior Contributors that publish an edit Tone Check was activated within and successfully return to make an unreverted edit to a main namespace during the identified retention period., 2) Proportion of newcomers and Junior Contributors that publish an edit Tone Check was activated within and return to make a new content edit without non-neutral language to a page in the main namespace during the identified retention period.
Curiosity #4Knowing the reasons why people do not elect to revise tone when the Check prompts them to do so (by platform), will help us to decide what (if anything) can be done to decrease the proportion of people on desktop who do so. //See discrepancy in dismissal rates by platform in leading indicator analysisDistribution of decline reasons grouped by platform and experience level

Leading indicators

T394463

Guardrails

Used to make sure that the new checks presented are not negatively impacting an editor’s experience completing an edit or causing disruption on the wikis. The scenarios named in the chart below emerged through T325851.

IDNameMetric(s) for EvaluationNotes
1)Edit quality decrease (T317700)Proportion of published edits that add new content and are still reverted within 48hours. Note: Will include a breakdown of the revert rate of published new content edit edits with and without non-neutral language.
2)Edit completion rate drastically decreasesProportion of new content edits started (defined as reaching point that Tone check was or would be shown) that are published. Note: Will include breakdown by the number of checks shown to identify if lower completion rate corresponds with higher number of check shown.
3)Edit abandonment rate drastically increasesProportion of contributors that are presented Tone Check and abandon their edits (indicated by event.action = abort and event.abort_type = abandon).We'd like to look at how abandonment rate varies by # of Checks shown in the context of this finding from the leading indicators analysis: "The revert rate of edits in which multiple Checks are shown is higher (24.7%) than edits in which a single Check is shown (13.8%)."
4)People shown Tone Check are blocked at higher ratesProportion of contributors blocked after publishing an edit where Tone Check was shown compared to contributors not shown the Tone Check
5)High false positive rateProportion of contributors that decline revising the text they’ve drafted and indicate that it was irrelevant.

A/B Test: Decision Matrix

IDScenarioIndicator(s)Plan of Action
1Tone Check is disrupting, discouraging, or otherwise getting in the way of volunteers. Read: people are less likely to publish the edits they start.≥20% decrease in edit completion rate in edit sessions where Tone Check is activated relative to edit sessions where Tone Check is not activated.Pause scaling plans; If results indicate that significant decreases are only associated with a high number of edit checks shown, set a threshold for the maximum number checks that can be shown within a single session. If we observe significant decreases for both single and multiple checks presented in a single session, investigate changes to the UX.
2Tone Check is increasing the likelihood that people will publish destructive edits.Increase in the proportion of published new content edits where Tone Check was activated that are reverted within 48 hours relative to edits that would have been shown Tone Check but were not. Increase in proportion of contributors blocked after publishing an edit where Tone Check is shown compared to contributors not shown Tone Check.Pause scaling plans, Review edits to try to identify any patterns in abuse and propose changes to UX to mitigate them.
3Tone Check is causing people to publish edits that align with project policies and that are not reverted.Increase in the proportion of new content edits Tone Check was activated within that were published without biased language and are not reverted within 48 hours relative to edits that would have been shown Tone Check but were not.Move forward with scaling plans
4Tone Check is effective at causing people to publish new content edits without biased language, but those edits are still reverted.Increase in the proportion of new content edits Tone Check was activated within that were published without biased language AND increase in the proportion of these edits that are reverted within 48 hours relative to edits that would have been shown Tone Check but were not.Pause scaling plans; Further investigation into tools used to detect biased language (e.g. might the false negative rate be too high); Analysis and manual review of reverted edits to understand why those edits were still reverted.
5Tone Check is not effective at causing people to publish new content edits without biased language but is not disrupting to volunteers.No change or decrease in the proportion of new content edits Tone Check was activated within that were published without biased language AND A) no significant drop in edit completion rate or B) no significant spike in block or revert rates.Pause with scaling plans in order to investigate what could explain Tone Check having a null effect

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
OpenNone
Openppelberg
OpenQuiddity
OpenMNeisler
Resolvedppelberg
ResolvedDLynch
Resolvedppelberg
ResolvedTrizek-WMF
Resolveddchan
Resolved isarantopoulos
Resolvedppelberg
Resolvednayoub
DuplicateNone
Resolvedppelberg
ResolvedBUG REPORTgkyziridis
ResolvedDLynch
Resolvedppelberg
OpenNone
DuplicateNone
OpenTrizek-WMF
ResolvedBUG REPORTppelberg
ResolvedBUG REPORTEsanders
Resolvedppelberg
In ProgressSucheta-Salgaonkar-WMF
ResolvedSucheta-Salgaonkar-WMF
Resolvedppelberg
Resolveddchan
ResolvedMNeisler
ResolvedDLynch
ResolvedDLynch
In ProgressTrizek-WMF
ResolvedMNeisler
ResolvedMNeisler
ResolvedEBlackorby-WMF
Resolvedppelberg
ResolvedDLynch
ResolvedEsanders
ResolvedDLynch
OpenNone
OpenTrizek-WMF
Resolvedzoe
ResolvedBWojtowicz-WMF
ResolvedBUG REPORTDLynch
ResolvedDLynch
ResolvedBUG REPORTDLynch
ResolvedBUG REPORTDLynch
Resolvedppelberg
OpenFeatureNone
Resolvedzoe
Invaliddchan
OpenMNeisler

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ppelberg renamed this task from [MILESTONE] Run an A/B test to evaluate impact of Peacock Check to [MILESTONE] Run an A/B test to evaluate impact of Tone Check.May 22 2025, 10:54 PM
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)