This task involves the work of running an A/B test of the Tone Check.
=== Status
|Description|Delivery date|Responsible|Status
|---|---|---|---
| Announce A/B test at pt.wiki (T395154)|Thursday, 21 August| @Trizek-WMF | ✅ Done
|Complete pre-deployment QA (T393817)| **Tuesday, 26 August**|Editing QA|
| Announce A/B test at ja.wiki (T395154)|**Monday, 25 August**| @Trizek-WMF|
|Deploy config to start A/B test at fr, ja, and pt (T389231) | **Thursday, 28 August**| Editing Engineering|
|Verify bucketing instrumentation (T394952) | //TBD//| Editing QA + Product Analytics|
|Verify test bucket balancing (T395090 )| //TBD//| Editing QA + Product Analytics|
|Complete first iteration of snapshot-based retraining (T398970 )|//TBD//| ML / Research (?)|
=== Overarching hypothesis
If we prompt newcomers and junior contributors to reconsider the tone they are writing in when software detects them using – what experienced volunteers would agree is – then non-neutral/peacock language, then we will decrease the percentage of new content edits newcomers publish that are reverted on the grounds of WP:NPOV (and related policies).
=== Decision to be made
This A/B test will help us make the following decision:
**What – if any – changes in the Peacock Check UX, and/or the model that enables it, will we make before we can be confident in the following...?**
# Newcomers and Junior Contributors that encounter Peacock Check are more likely to publish new content edits in the main namespace that are devoid of biased language.
# Newcomers and Junior Contributors will intuitively interact with the Peacock Check experience in ways that are NOT disruptive to them or the wikis
=== Open questions
- [x] 1. What – if any – ceiling will we place on the number of Peacock Checks that people can see within a given edit session?
-- No ceiling will be placed on the number of Peacock Checks shown. We'll learn whether such a constraint might be necessary via Guardrail Metric #2 listed below.
=== KPIs
//The main outcomes we are trying to impact through this feature. These are what we are primarily using for evaluating the hypothesis and deciding whether to deploy an intervention more widely.//
|ID|Hypothesis| Decision(s) to be made | Metric(s) for evaluation
|---|---|---
|**KPI**|The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain peacock language| //Does showing people a prompt when using non-neutral language lower the likelihood that new content edits include non-NPOV?//| 1) Proportion of all new content edits published without biased language and 2) Proportion of new content edits that are not reverted.
|**KPI**| Newcomers and Junior Contributors will experience Peacock Check as encouraging because it will offer them more clarity about what is expected of the new information they add to Wikipedia | Does showing people a prompt discourage them from publishing would-be quality edits?| Proportion of new content edits started (defined as reaching point that peacock check was or would be shown) that are successfully published (not reverted).
=== Secondary metrics
//Used to learn about additional impact of Peacock Check, but are not primary targets of the intervention. They reveal side effects (both positive and negative) of trying to improve the Primary Metric with the intervention.//
|ID|Hypothesis|Metric(s) for evaluation
|---|---|---
|**Curiosity #1**| New account holders will be more likely to publish an unreverted edit to the main namespace within 24 hours of creating an account because they will be made aware the new text they're attempting to publish needs to be written in a neutral tone, when they don't first think/know to write in this way themselves| Constructive activation. //Note: we'd need to break this out by platform. Reason, WE 1.2 is scoped to mobile-only.//
|**Curiosity #2**|Newcomers and Junior Contributors will be more aware of the need to write in a neutral tone when contributing new text because the visual editor will prompt them to do so in cases where they have written text that contains peacock language.|The proportion of newcomers and Junior Contributors that publish at least one new content edit that does not contain peacock language. //See T388716.//
|**Curiosity #3**|Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that does //not// include peacock language because Peacock Check will have caused them to realize when they are at risk of of this not being true.|**1)** Proportion of newcomers and Junior Contributors that publish an edit Peacock Check was activated within and successfully return to make an unreverted edit to a main namespace during the identified retention period., **2)** Proportion of newcomers and Junior Contributors that publish an edit Peacock Check was activated within and return to make a new content edit without non-neutral language to a page in the main namespace during the identified retention period.
=== Leading indicators
T394463
=== Guardrails
//Used to make sure that the new checks presented are not negatively impacting an editor’s experience completing an edit or causing disruption on the wikis. The scenarios named in the chart below emerged through T325851.//
|ID|Name|Metric(s) for Evaluation
|---|---|---
|**1)**|Edit quality decrease (T317700)|Proportion of published edits that add new content and are still reverted within 48hours. //Note: Will include a breakdown of the revert rate of published new content edit edits with and without non-neutral language.//
|**2)**|Edit completion rate drastically decreases|Proportion of new content edits started (defined as reaching point that peacock check was or would be shown) that are published. //Note: Will include breakdown by the number of checks shown to identify if lower completion rate corresponds with higher number of check shown.//
|**3)**| Edit abandonment rate drastically increases |Proportion of contributors that are presented Peacock Check and abandon their edits (indicated by `event.action = abort` and `event.abort_type = abandon`).
|**4)**|People shown Peacock Check are blocked at higher rates|Proportion of contributors blocked after publishing an edit where Peacock Check was shown compared to contributors not shown the Peacock Check
|**5)**|High false positive rate|Proportion of contributors that decline revising the text they’ve drafted and indicate that it was irrelevant.
=== A/B Test: Decision Matrix
| ID | Scenario | Indicator(s) | Plan of Action |
|----|----------|--------------|----------------|
| 1 | Peacock Check is disrupting, discouraging, or otherwise getting in the way of volunteers. //Read: people are less likely to publish the edits they start.// | ≥20% decrease in edit completion rate in edit sessions where Peacock Check is activated relative to edit sessions where Peacock Check is not activated. | **Pause** scaling plans; If results indicate that significant decreases are only associated with a high number of edit checks shown, set a threshold for the maximum number checks that can be shown within a single session. If we observe significant decreases for both single and multiple checks presented in a single session, investigate changes to the UX. |
| 2 | Peacock Check is increasing the likelihood that people will publish destructive edits. | Increase in the proportion of published new content edits where Peacock Check was activated that are reverted within 48 hours relative to edits that //would have been shown// Peacock Check but were not. Increase in proportion of contributors blocked after publishing an edit where Peacock Check is shown compared to contributors not shown Peacock Check. | **Pause** scaling plans, Review edits to try to identify any patterns in abuse and propose changes to UX to mitigate them. |
| 3 | Peacock Check is causing people to publish edits that align with project policies and that are not reverted. | Increase in the proportion of new content edits Peacock Check was activated within that were published without biased language and are not reverted within 48 hours relative to edits that would have been shown Peacock Check but were not. | **Move forward** with scaling plans |
| 4 | Peacock Check is effective at causing people to publish new content edits without biased language, but those edits are still reverted. | Increase in the proportion of new content edits Peacock Check was activated within that were published without biased language AND increase in the proportion of these edits that are reverted within 48 hours relative to edits that would have been shown Peacock Check but were not. | **Pause** scaling plans; Further investigation into tools used to detect biased language (e.g. might the false negative rate be too high); Analysis and manual review of reverted edits to understand why those edits were still reverted. |
| 5 | Peacock Check is not effective at causing people to publish new content edits without biased language but is not disrupting to volunteers. | No change or decrease in the proportion of new content edits Peacock Check was activated within that were published without biased language AND A) no significant drop in edit completion rate or B) no significant spike in block or revert rates. | **Pause** with scaling plans in order to investigate what could explain Peacock Check having a `null` effect