Page MenuHomePhabricator

[A/B Test] Report on Reference Check (en.wiki) leading indicators
Closed, ResolvedPublic

Description

≥2 weeks after starting of the Reference Check A/B Test (T400101), we will check on a set of leading indicators (outlined below).

We will use this ticket to scope and conduct this analysis.

Analysis timing

Analysis can begin on 18 November 2025

NOTE: analysis will begin ≥2 weeks after starting of the Reference Check A/B Test

Decisions to be made

  • 1. What – if any – UX adjustments/investigations will we prioritize for us to be confident moving forward with evaluating the Reference Check's impact in T400101?
    • None. See "Conclusions" in the === Leading indicators table below.
  • 2. What – if any – adjusts will we make to experiment's design to ensure enough newcomers are encountering Reference Check for us to draw statistically significant conclusions about it?
    • None. See "Conclusions" in the === Leading indicators table below.

Leading indicators

Metrics

IDNameOwnerMetric(s) for EvaluationConclusion
1.Newcomers are not encountering Reference CheckEditing⭐ Proportion of new content edits Reference Check is shown withinReference Check is shown within a sufficient number of new content edits. Reference Check was shown at least once in 42.4% of all published new-content edits by newer editors in the test group. For reference, this rate is higher than rates observed for Tone Check (9%) and Paste Check (36%).
2.Newcomers are not understanding the featureEditing⭐ Proportion of contributors that are presented Reference Check and abandon their editsEdits shown Reference Check are completed at a lower rate (87.1%) than eligible edits not shown Reference Check (90.6%), a 4% relative decrease. This slight decrease is not surprising as we are introducing an extra step in the workflow; as it's well below a 10% relative difference we do not see signs of concern at this time.
3.People deem Reference Check irrelevantEditingProportion of edits wherein people elect NOT to cite the text they are attempting to add
4.Reference Check is causing disruptionEditing1) Proportion of published edits that add new content and are reverted within 48hours and 2) Proportion of people blocked after publishing an edit where Reference Check was shown1) Published new-content edits shown Reference Check are reverted less frequently, with a 13.7% relative decrease compared to eligible edits not shown the check (29.3% for the control and 25.3% for the treatment).

⭐ = Metrics we will consider required and prioritize work on first

Done

Event Timeline

ppelberg updated the task description. (Show Details)

@ppelberg: I’ve completed an analysis of Reference Check priority leading indicators as described in T405421. See the summary of findings below; the full report (GitLab) includes expanded metrics and breakdowns by platform, number of checks, and user experience. Note: Data reflects events logged in the first two weeks of the Reference Check A/B test on English Wikipedia. As with other Edit Check leading-indicator assessments, additional event volume will be required to confirm statistical significance. Full A/B test results will be reviewed in the forthcoming analysis for T400101.

Reference links (Referenced below):
Tone Check Leading Indicators
Paste Check Leading Indicators
Multi-Check Indicators & 2025 Multiple Reference Check A/B test
2024 Reference Check A/B test

TL;DR: Early indicators suggest Reference Check appears fairly often, may slightly lower edit completion rates, and is associated with early reductions in revert rates across all user types.

Brief Summary
Early indicators suggest Reference Check is shown fairly often—more than Paste Check or Tone Check, but less frequently than earlier multi-check estimates—and it fires more on mobile web than desktop. While the check may slightly lower edit completion rates (especially when many checks appear), it is also associated with reduced revert rates across all user groups. Effects vary by experience level: newcomers see more checks, and although completion rates rise for unregistered and newcomer editors, they fall for junior contributors.

Summary
Reference Check Frequency: Reference Check was shown at least once in 42.4% of all published new-content edits by newer editors in the test group. This is higher than observed in Paste Check (36%), higher than Paste Check's initial estimates in T403861 for published edits, and much higher than Tone Check in the Leading Indicators Analysis (9%). This frequency is lower than trends observed in the Multi-Check Indicators Analysis, where Reference Check was presented in about 78% of published new-content edits.

  • By platform: A notably higher proportion of mobile web edits were shown Reference Check (76.3%) compared to desktop (38.2%). This contrasts with the Paste Check Leading Indicators report patterns, where desktop edits were more frequently shown Paste Check (39%) than mobile web edits (24%).
  • By user experience: Reference Check appears slightly more frequently for newcomers: Newcomer new content edits are 2.5% more likely to be shown Reference Check relative to unregistered users, and 38.8% more likely relative to junior contributors. In the 2025 Multiple Reference Check A/B test we observed a noticeably stronger effect for newcomers than for unregistered users however junior contributors in the treatment group showed the highest percentages overall as far as adding references more often when exposed to multiple reference checks.

Edit Completion Rate: Edits shown Reference Check are completed at a lower rate (87.1%) than eligible edits not shown Reference Check (90.6%), a 4% relative decrease. This aligns with the 2024 Reference Check A/B test, where showing Reference Check produced a 10% decrease in edit completion rate relative to control.

  • By platform: Completion decreased modestly on both platforms. On mobile web there was a 1.5% relative decrease for the treatment group (79.7%) compared to the control (80.9%). Desktop saw a 5.9% relative decrease for the treatment group (89.2%) compared to the control (94.8%). In the 2024 Reference Check A/B test, the pattern was more dramatic on mobile (–24.3%) than desktop (–3.1%). Early trends here look milder by comparison but consistent in direction.
  • By user experience: Edit completion rates increased for unregistered editors (control: 80.8%, treatment: 84.7%) and for newcomers (control: 84.3%, treatment: 87.1%). Junior contributors in the treatment group (87.8%) saw a 6.4% decrease relative to their control group (93.8%) counterparts. This variation across user experiences echoes differences seen in the 2024 Multi Check Leading Indicators Analysis, where unregistered editors were largely unaffected, newcomers showed small declines when multiple checks were presented, and junior contributors in the treatment group showed a slight relative increase (3%) compared to the control.

Revert Rate: Published new-content edits shown Reference Check are reverted less frequently, with a 13.7% relative decrease compared to eligible edits not shown the check (29.3% for the control and 25.3% for the treatment). This is steeper than the 8.6% relative decrease observed in the 2024 Reference Check A/B test and higher than revert rates in the 2025 multiple Edit Checks A/B test, where control and treatment estimates were around 22.5–23.6%. Important note: These revert rates include edits where the final published text may not include a reference. We plan to review the proportion of new content edits shown or eligible to be shown Reference Check that include a reference in the AB test analysis.

  • By platform: Both desktop (24.4%) and mobile web (29.1%) treatment groups show improved revert rates relative to their controls (desktop: 25.3%, mobile web: 45.8%). Mobile web editors in the treatment group saw a 36.5% relative decrease compared to the control group while desktop editors saw a 3.6% relative decrease. This is consistent with earlier findings: In the 2024 Reference Check A/B test, relative revert rates decreased on both platforms (desktop –9.4%, mobile –5.9%) and in the 2025 Multiple Reference Check A/B test, mobile web treatment group edits also tended to show higher revert rates than desktop treatment group edits.
  • By user experience: Revert rates decreased across all user experience types. Unregistered editors saw a decrease from 36.8% in the control to 32.2% in the treatment, newcomers from 42.3% to 39.6%, and junior contributors from 25.1% to 20.4%. This pattern is consistent with the 2024 Reference Check AB Test and in part with the 2025 Multiple Reference Checks A/B test analyses where revert rates decreased for newcomers and slightly increased for junior contributors and unregistered editors when comparing treatment to control groups. The 2025 Multiple Reference Checks A/B test highlights, “Results vary slightly based on the type of user completing the edit but none of the observed changes were statistically significant.”

On dismissal data: We expected every dismissal to include one of four required reasons (Other, Irrelevant, Uncertain, or Common Knowledge), since users must select a reason when dismissing Reference Check. In practice, we see cases where Reference Check was shown and the user dismissed without a reason provided.

Edge-case note
Observed: In sessions where users dismissed a Reference Check (action-reject), 11% do not have a corresponding reason event observed in logs.
Impact: This does not affect any primary reporting for this ticket. It would only affect optional dismissal-reason breakouts, which are out of scope.
Decision: Proceed with the primary scope of the report. Dismissal-reason breakouts are excluded from this analysis.

@Iflorez: thank you for bringing this all together! A small question for you below.

Before that...

Mobile web editors in the treatment group saw a 36.5% relative decrease compared to the control group while desktop editors saw a 3.6% relative decrease.

Wow. So Reference Check is having ~10x higher impact on reducing revert rates on mobile than it is on desktop?

Question

This frequency is lower than trends observed in the Multi-Check Indicators Analysis, where Reference Check was presented in about 78% of published new-content edits.

Might you be able to where specifically this 78% is coming from? I didn't see this metric after giving the Multi-Check analysis a quick read.


Meta: I very much appreciate how you accompanied each finding with what we saw/are seeing with Tone Check, Paste Check, and previous Reference Check experiments (Multi-Ceck included)

On dismissal data:
We expected every dismissal to include one of four required reasons (Other, Irrelevant, Uncertain, or Common Knowledge), since users must select a reason when dismissing Reference Check. In practice, we see cases where Reference Check was shown and the user dismissed; 11% of dismissals have no reason provided.

Noted. We're investigating this in T412129.

ppelberg updated the task description. (Show Details)

@ppelberg Thank you for the questions, happy to clarify.

Question

This frequency is lower than trends observed in the Multi-Check Indicators Analysis, where Reference Check was presented in about 78% of published new-content edits.

Might you be able to where specifically this 78% is coming from?

The referenced 78% comes from the Multi-Check Indicator's Analysis, specifically the section "Published new content edits shown at least one reference check by experiment group". I’ll also add this link directly to the summary comment on this ticket.

Wow. So Reference Check is having ~10x higher impact on reducing revert rates on mobile than it is on desktop?

Short answer: Not necessarily.
In more detail: Early indicator data suggests Reference Check is linked to lower revert rates on both desktop and mobile web. The relative reduction appears more pronounced on mobile web and may be influenced by mobile’s higher baseline revert rate in this slice. This doesn’t show that Reference Check works better on mobile web; this result is directional and correlational. We’ll evaluate revert rates in the A/B analysis and share those results once available.

Appreciate that feedback and I’m glad the comparisons across Tone Check, Paste Check, and previous Reference Check experiments (Multi-Check included) were helpful. Those were included to make the directional context and tradeoffs clear for decision-making.