Page MenuHomePhabricator

[A/B Test] Report on Tone Check leading indicators
Closed, ResolvedPublic

Description

≥2 weeks after starting of the Tone Check A/B Test (T387918), we will check on a set of leading indicators (outlined below).

We will use this ticket to scope and conduct this analysis.

Analysis timing

Target completion date: Wednesday, 24 Sep 2025

Decision(s) to be made

What – if any – adjustments/investigations will we prioritize for us to be confident moving forward with evaluating the Peacock Check's impact in T387918?

Leading indicators

Metrics

IDNameOwnerMetric(s) for EvaluationConclusion
1.Newcomers and Junior Contributors are not encountering Peacock CheckEditingProportion of new content edits Peacock Check is shown within
2.Newcomers and Junior Contributors are not understanding the featureEditingProportion of contributors that are presented Peacock Check and abandon their edits
3.People deem Peacock Check irrelevantEditingProportion of edits wherein people elect to dismiss/not change the text they've added
4.Peacock Check is causing disruptionEditing1) Proportion of people blocked after publishing an edit where Peacock was shown and 2) Proportion of published edits that add new content and are reverted within 48hours
5.Model is not able to evaluate tone of published edit quickly enoughEditingProportion of edits that are published before the model is able to return an evaluation. See T388716.
6.Model service availabilityMLService Availability SLO: 95% of all requests return a 200/300/400 response code. See T39070699.99% (T405338)
7.Model is not delivering responses quickly enoughMLProportion of all requests that return a response within 1000 milliseconds97.04 % (T405338)

Details

Other Assignee
MNeisler

Related Objects

Event Timeline

ppelberg renamed this task from [A/B Test] Report on Peacock Check (References) leading indicators to [A/B Test] Report on Peacock Check leading indicators.
Aklapper renamed this task from [A/B Test] Report on Peacock Check leading indicators to [A/B Test] Report on Tone Check leading indicators.May 28 2025, 11:43 AM
ppelberg added a subscriber: isarantopoulos.

Next step(s)
Per today's ML-Editing meeting, @isarantopoulos is going to update the leading indicators the Machine Learning team is responsible for (rows "6." and "7.") to align with the SLO (and the dashboard we'll use to monitor the service) in T390706.

I have a follow up question regarding row 6:
We have the following SLI to monitor model latency : 90% of all requests return a response within 1000 milliseconds
However in row 6 it states:

Proportion of edits that are published before the model is able to return an evaluation.

There is no way for the ML service to know when the edit is published. If we need to calculate this metric it will happen on the VisualEditor side. As I see the work to log this has already been implemented in https://phabricator.wikimedia.org/T388716#10872915
@ppelberg shall I add an 8th row to capture the latency SLI related to the LiftWing model I mention above?

I have a follow up question regarding row 6:
We have the following SLI to monitor model latency : 90% of all requests return a response within 1000 milliseconds
However in row 6 it states:

Proportion of edits that are published before the model is able to return an evaluation.

There is no way for the ML service to know when the edit is published. If we need to calculate this metric it will happen on the VisualEditor side. As I see the work to log this has already been implemented in https://phabricator.wikimedia.org/T388716#10872915
@ppelberg shall I add an 8th row to capture the latency SLI related to the LiftWing model I mention above?

Oh, yes. Please! Thank you, @isarantopoulos (and I'm sorry for the lag).

5.People are encountering Peacock Check two late in editing journeyEditingProportion of Peacock Checks shown Mid-Edit compared to Pre-save

Per today's offline meeting with @MNeisler, we're removing the above from the scope of this task.

Reason: as currently designed, there is no way for someone to NOT encounter a Tone Check in the Pre-save movement without having first seen it during the Mid-edit moment.

Per offline discussion with @isarantopoulos, what's needed to report on metrics related to the Tone Check model's availability and latency is in place (metrics 5., 6., and 7. above).

We have faced some difficulties reporting on 6. and 7. which is related to how the prometheus metrics and the functions are defined. One can read more about the challenges on the Wikitech SLO page.
In T405338: Calculate tone check model service metrics for fixed calendar window we have managed to extract some results for these indicators for the previous 21 days that the experiment has been running and althought they are not 100% accurate due to the aforementioned issues they still provide good insights into what has happened with the service during that period and are both above the defined thresholds.

I've completed an initial analysis of Leading Indicators 1-5. See the summary of some key findings below and full report for details on other breakdowns and methodology.

Note: Results are based on initial AB test data logged between 8 September and 22 September. More event data will be needed to confirm statistical significance for many of these findings especially for any per user experience or per Wikipedia breakdowns. We will review the complete AB test data (based on two week duration) as part of the analysis in T387918.

Summary of Results

  1. Proportion of edits Tone Check is shown within
    • Tone Check has been shown within 421 editing sessions across all three partner Wikipedias over the reviewed two-week timeframe. This represents only about 0.1% of all edit attempts. When limited to saved edits, Tone Check has been shown in 9% of all published new content edits (125 of 1,377 published edits in the test group) since the AB test started.
    • It appears just slightly more frequently for desktop published edits compared to mobile. It has been shown at 9.5% of all published new content edits on desktop and 7.6% of all published new content edits on mobile.
  1. Proportion of contributors that are presented Tone Check and complete their edits
    • We’ve observed a slight -2.3% decrease in the edit completion rate for edits shown tone check compared to eligible edits in the control group.  In the test group, 66.7% of all edits shown tone check were successfully completed compared to 68.3% in the control group. So far, there have been no significant decreases in edit completion rate by experience level, Wikipedia, or for editing sessions where multiple tone checks were shown.
  1. Proportion of edits wherein people elect to dismiss/not change the text they've added
    • A little over half of all published edits where tone check was shown (57%) included at least one tone check that the user dismissed. This is similar to the rates observed for Reference Check.
    • Tone checks are dismissed more frequently on desktop compared to mobile. 63.8% of all published desktop edits where tone check was shown include at least one check that was dismissed, compared to 39% of all published mobile edits.
  1. Proportion of published edits that add new content and are reverted within 48hours
    • There have been no significant changes in the revert rate of all new content edits overall or by platform or Wikipedia. However, we’ve observed decreases in revert rate when limiting to edits where tone check was shown or eligible to be shown.
    • We’ve observed a -5.3% decrease in the revert rate of desktop edits and -19% decrease in the revert rate of mobile edits for edits where tone check was shown at least once in an editing session compared to eligible edits in the control group.
    • For edits shown tone check and where text was revised to address the issue, we're currently seeing almost a 2x decrease in revert rate compared to eligible control edits. More published edit data is needed to confirm impacts on a per-Wikipedia and platform basis.
  1. Proportion of people blocked after publishing an edit where Tone Check was shown
    • Less than 1% users have been blocked after publishing an edit where at least one tone check was shown.
  1. Proportion of edits that are published before the model is able to return an evaluation
    • Only about 0.6% of all published edits (264 edits) in the AB test were saved before the model returned an evaluation. The majority of these edits occurred in the control group and on desktop.

cc @ppelberg

This is data is incredibly useful - wonderfully done, @MNeisler.

A clarifying question for you...

Tone Check has been shown in 9% of all published new content edits (125 of 13,777 published edits in the test group) since the AB test started.

The above is a quote from the report. Might you have intended to say, "...(125 of 1,377 published edits in the test group)..." ?

...what's currently written suggestion Tone Check has been shown in 0.9% of all published new content edits.


RE next steps, we are discussing these findings internally and will decide how and if to adjust the experience in response to this early data in the next 2 weeks.

The above is a quote from the report. Might you have intended to say, "...(125 of 1,377 published edits in the test group)..." ?

@ppelberg
Yes, this was typo. That statement should read "125 of 1,377 published edits in the test group". I've updated the relevant text in the summary provided in T394463#11211521 and in the final report. Thank you for catching this!

The above is a quote from the report. Might you have intended to say, "...(125 of 1,377 published edits in the test group)..." ?

@ppelberg
Yes, this was typo. That statement should read "125 of 1,377 published edits in the test group". I've updated the relevant text in the summary provided in T394463#11211521 and in the final report. Thank you for catching this!

Excellent, ok. And you bet.

Path forward

What – if any – adjustments/investigations will we prioritize for us to be confident moving forward with evaluating the Tone Check's impact in T387918?

Per what we [i] discussed offline, we have decided not to make any adjustments to the Tone Check user experience or experiment design right now.

Instead, we will consider these adjustments in a future experiment (T407537).


It's important to note that we did consider evaluating the potential impact of lowering the Tone Check model's confidence threshold to see if doing so could broaden the feature's impact without impacting the positive effects it seems to be having in the ~9% of edits it's being activated within currently.

Ultimately, we decided not to do so for the following reasons:

  1. It would be most helpful to review any would-be changes in production
  2. The Tone Check UX is only available in production at the three wikis where an a/b experiment is currently running
  3. Adjusting the model at said "three wikis" would negatively impact our ability to draw meaningful conclusions on the effect of the intervention (via T387918) before Q2's end

i. Megan and members of WE1.1 teams