Page MenuHomePhabricator

[WE 1.1] Run a controlled experiment to evaluate impact of Reference Check at en.wiki
Closed, ResolvedPublic

Description

This task involves the work of running a controlled experiment of Reference Check that is isolated to English Wikipedia.

This test:

  • Builds on the previous A/B test we ran of Reference Check at 15 Wikipedias (T342930)
  • Is a response to:
    • Reference Check not yet being available at en.wiki
    • Volunteers there making multiple inquiries/proposals about enabling Reference Check at en.wiki (1, 2)

Experiment timeline

MilestoneTarget Completion DateResponsibleStatusNotes
Start test6 Nov. 2025Editing Engineering
Verify test bucket balancing (T406134 )13 November 2025Editing QA + @MNeisler
Publish leading indicators report (T405421)3 Dec. 2025@IflorezAnalysis can begin ~2 weeks after test starts
End test4 December 2025Editing EngineeringAnalysis can begin ~6 weeks after test starts
Publish final report19 December 2025@Iflorez~2 weeks after analysis starts
Share statistically significant conclusions on-wikiJanuary 2026@Sdkb
NOTE: timeline was last updated 12 Nov 2025 following offline conversation between Megan and Peter

Decision(s) To Be Made

  • 1. Will Reference Check be enabled by default newcomers editing with VE at en.wiki? If so, how – if at all – would experienced volunteers like the Check's default configurations to be changed?
    • Enabled: Yes
    • Configuration: TBD; @Sdkb is preparing an announcement to start a discussion about this

Hypotheses

IDHypothesisMetric(s) for evaluation
KPIThe amount of constructive edits newcomers will publish will increase because a greater percentage of edits that add new content will include a reference or an explicit acknowledgement as to why these edits lack references.1) Proportion of published edits that add new content and include a reference or explicit acknowledgement of why a citation was not added, 2) Proportion of published edits that add new content (T333714) and are constructive (read: NOT reverted within 48 hours )
Curiosity #1Newcomers will be more aware of the need to add a reference when contributing new content because the visual editor will prompt them to do so in cases where they have not done so themselves.Increase in the proportion of newcomers that publish at least one new content edit that includes a reference.
Curiosity #2Newcomers will be more likely to return to publish a new content edit in the future that includes a reference because Reference Check will have caused them to realize references are required when contributing new content to Wikipedia.1) Proportion of newcomers that publish an edit Reference Check was activated within and successfully and return to make an unreverted edit to a main namespace during the identified retention period., 2) Proportion of newcomers that publish an edit Reference Check was activated within and return to make a new content edit with a reference to a main namespace during the identified retention period.

Leading indicators

T405421: [A/B Test] Report on Reference Check (en.wiki) leading indicators

Guardrails

This section describes the metrics we will use to make sure other important parts/dimensions of the "editing ecosystem" are not being negatively impacted by Reference Check. The scenarios named in the chart below emerged through T325851.

IDNameMetric(s) for Evaluation
1)Edit quality decrease (T317700)Proportion of published edits that add new content and are still reverted within 48hours. Will include a breakdown of revert rate of published edits with and without a reference added.
2)Edit completion rate drastically decreasesProportion of edits that reach the point Reference check was shown or would be shown that are successfully published (event.action = saveSuccess)
3)People shown Reference Check are blocked at higher ratesProportion of contributors blocked after publishing an edit where Reference Check was shown
4)High false positive or false negative ratesA) Proportion of new content edits published without a reference and without being shown Reference check (indicator of false negative) & B) Proportion of contributors that dismiss adding a citation and select "I didn't add new information" or other indicator that their edit doesn't require a citation

A/B Test: Decision Matrix

IDScenarioIndicator(s)Plan of Action
1)Reference Check is disrupting, discouraging, or otherwise getting in the way of volunteers who are attempting to make edits in good faith. Read: people are less likely to publish the edits they start.Significant drop in edit completion and spike in edit abandonment in edit sessions where Reference Check is activated. Will include breakdown to review edits where reference reliability check was included.Pause scaling plans; investigate changes to UX
2)Reference Check is increasing the likelihood that people will publish destructive editsIncrease in proportion of contributors blocked after publishing an edit where Reference check is activated, Increase in proportion of published edits where Reference check was activated and are reverted within 48 hours relative to new content edits Reference check was NOT activated within.Pause scaling plans, review edits to try to identify pattern in abuse and propose changes to UX to mitigate them
3)Reference Check is causing people to publish edits that align with project policiesIncrease in the proportion of edits Reference check was activated within that include a reference and are not reverted within 48 hours relative to new content edits without a reference edit check was NOT activated withinMove forward with scaling plans
4)Reference Check is effective at causing people to accompany new content edits that include a reference, but those references are unreliableIncrease in the proportion of published edits Reference check was activated within that include a reference and increase in the proportion of these edits that are reverted within 48 hoursBlock scaling plans and consider mitigations to address reference reliability (e.g. T276857)
5)Reference Check is not effective at causing people to accompany new content edits that include a reference but is not disrupting to volunteers.No change or decrease in the proportion of published edits Reference check was activated within that include reference and A) no significant drop in edit completion or abandonment rate or B) no significant spike in block or revert rateMove forward with scaling plans

Done

Related Objects

Event Timeline

ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)

@ppelberg and @Iflorez, when the final report is available, can you share a link so that I can draft an update for sharing on en-WP and update the Asana task?

@ppelberg: I've completed an analysis reviewing the impact of Reference Check based on the results of AB test collected from 8 November 2025 through 8 December 2025. See summary of results below and the full report for additional details on methodology and metrics by various dimensions (including number of checks shown, platform, user experience).

TL;DR: When Reference Check was shown, edits were much more likely to add a new reference, edits were more often constructive, and reverts declined, with a slight reduction in edit completion; these patterns are especially evident on mobile web.

Brief Summary: When Reference Check is shown, edits are significantly more likely to add a reference, especially on mobile web. Edits shown Reference Check are directionally more constructive and less likely to be reverted within 48 hours, with the strongest and most consistent improvements on mobile web. While Reference Check slightly reduces edit completion, the decrease is modest.

Summary of Results
References Added or Acknowledged KPI #1
When Reference Check was shown, edits were far more likely to either add a reference or clearly explain why they didn’t.
Given the current KPI1 definition, direct test–control comparisons are difficult to interpret because the “Decline” option is available only in treatment. To ensure a fair comparison across test and control groups, we focus on KPI1b, which compares control and treatment without the decline option. And, we also report KPI #1b on an ITT basis for comparability with the 2024 report.

Reference Added KPI #1b: Proportion of published edits that add new content, are constructive (not reverted within 48 hours), and include at least one net new reference; shown-only in test vs eligible-not-shown control.
When Reference Check was shown, edits were much more likely to add a new reference. This effect is large and statistically significant in the pooled adjusted model (adjusted for platform). How big is the change:

  • Desktop: Edits were ~2.2× more likely to add a new reference (30.7% → 68.2%).
  • Mobile-web: Edits were ~17.5× more likely to add a new reference (2.8% → 48.9%).

The increase in references added is substantial on both platforms.
Across both adjusted models and simpler comparisons, the evidence is clear: Reference Check materially increases the likelihood that edits add a new reference(s).

Note: KPI #1 and KPI #1b above compare edits where Reference Check was shown with edits that were eligible but not shown (exposure-style, per-protocol). For KPI #1b, this is further limited to constructive new-content edits that were not reverted within 48 hours.

In one sentence: Among constructive new‑content edits (not reverted within 48 hours), edits where Reference Check was shown were ~2.2× more likely on desktop (30.7%→68.2%) and ~17.5× more likely on mobile web (2.8%→48.9%) to include at least one net new reference compared to eligible edits where Reference Check was not shown.

Reference Added KPI #1b (Availability / ITT; comparable to 2024)
We also report KPI #1b by test vs. control assignment, regardless of whether Reference Check was shown (intent-to-treat). This includes all published new-content edits in the target population and is directly comparable to 2024 Reference Check study data. Under this lens, Reference Check still shows a benefit: edits in the test group were more likely to be constructive new-content edits that included a net new reference. How big is the change (ITT):

  • Overall: 56.3% → 68.3% (+12.1 pp, +21.5% relative) (~1.2× more likely to add a reference)
  • Desktop: 60.5% → 70.6% (+10.2 pp, +16.8% relative) (~1.17× more likely to add a reference)
  • Mobile web: edits were ~2.2× more likely to add a reference (22.0% → 47.8%)

Even under the conservative ITT lens (not restricted to shown/eligible), edits in the test group were more likely to be constructive new-content edits that included a net new reference, especially on mobile-web; this lift is statistically significant in the adjusted ITT model.
Note: In the 2024 Reference Check report, Users [i.e., based on an edit-level (edit session) comparison—not unique-user aggregation] were "2.2 times more likely to publish a new content edit that included a reference and was constructive (not reverted within 48 hours) when reference check was shown to eligible edits". "On mobile, new content edits by contributors [i.e., based on an edit-level (edit session) comparison—not unique-user aggregation] were 4.2 times more likely to include a reference and not be reverted when reference check was shown to eligible edits."

Constructive Edits (Not Reverted Within 48 Hours), KPI #2
Edits shown Reference Check were more likely to remain constructive, especially on mobile-web. How big is the change:

  • Desktop: Edits were 3.2% (relative lift) more likely to be constructive (75.5% → 77.9%).
  • Mobile-web: Edits were 18.2% (relative lift) more likely to be a constructive edit (56.4% → 66.7%).

Constructive outcomes trend higher when Reference Check is shown across both platforms, with a much larger improvement on mobile-web. While the most conservative adjusted model cannot fully rule out chance at this sample size, simpler and relax-based comparisons point in the same direction: the test group performs better. Both overall results and mobile-web-only analyses show meaningful improvements, with strong gains on mobile web. Although we cannot definitively conclude that mobile improvements are larger than desktop, the results consistently suggest this pattern. When we account for whether an edit added a new reference, the mobile-web advantage becomes smaller and no longer statistically clear, indicating that part of the benefit may come from increased reference inclusion. Overall, the evidence suggests Reference Check improves constructive editing outcomes, especially on mobile web.

Constructive Edits (Not Reverted Within 48 Hours), KPI #2 [Mobile-Web Only]
On mobile-web, Reference Check meaningfully increases the likelihood that an edit is constructive. Edits were 18.2% more likely to be constructive edit (56.4% → 66.7%). On mobile-web, Reference Check meaningfully increases the likelihood that an edit is constructive. Edits 18.2% more likely to be constructive (56.4% → 66.7%).

Across analyses, results are consistent: on mobile web, edits shown Reference Check are more likely to remain constructive. The signal is clearest when we look at mobile web alone and aligns with simpler comparisons. When accounting for reference addition, the mobile web difference narrows, suggesting part of the improvement is driven by increased reference inclusion.

Revert Rate Within 48 Hours (Lower Is Better), Guardrail 1
Edits shown Reference Check were less likely to be reverted within 48hours. How big is the change:

  • Desktop: revert rates declined 9.8% relative (24.5% → 22.1%).
  • Mobile-web: revert rates declined 23.6% relative (43.6% → 33.3%).

Revert rates declined on both platforms when Reference Check was shown, with a notably larger reduction on mobile web. While the most conservative model cannot fully attribute the decrease to the treatment alone, simpler comparisons consistently favor the test group. Importantly, edits that include a new reference are much less likely to be reverted, supporting the benefit of including references.
Note: In the 2024 Reference Check report, New content edit revert rate decreased by 8.6% if reference check was available. While some nonconstructive new content edits with a reference were introduced by this feature (5 percentage point (pp) increase), there was a higher proportion of constructive new content edits with a reference added (23.4 pp increase).

Edit Completion (SaveIntent → SaveSuccess), Guardrail 2
We did not observe any drastic decreases in edit completion rate. Reference Check slightly reduces the likelihood that an edit is completed, and this effect is statistically significant. How big is the change:

  • Desktop: Completion decreased 6.8% relative (94.0% → 87.6%).
  • Mobile-web: Completion decreased 6.3% relative (74.1% → 69.4%).

This guardrail shows a real and statistically meaningful decrease in completion. Overall, Reference Check introduces measurable friction that leads to lower completion rates, but this trade-off coincides with higher-quality outcomes: more references added, fewer reverts, and improved constructive edits on mobile-web.

Note: In the 2024 Reference Check report, there was a 10% decrease in edit completion rate for edits where reference check was shown compared to the control group. There was a higher observed decrease in edit completion rate on mobile compared to desktop. On mobile, edit completion rate decreased by -24.3% (-13.5pp) while on desktop it decreased by only -3.1% (-2.3pp). Note: The completion rates reported in this 2024 report includes saved edits that were reverted.

cc @Sdkb

Thank you for bringing this all together, @Iflorez. Some clarifying questions for you...

Question #1 How might you combine the three statements quoted below into one sentence? Asked another way: would it be accurate for me to think the 3 statements below 'combine' into something like the following: "On mobile web, new content edits made in the test group were ~17.5× more likely to include a reference and not be reverted within 48 hours than new content edits made in the control group."

Desktop: Editors were ~2.2× more likely to add a new reference (30.7% → 68.2%).
Mobile-web: Editors were ~17.5× more likely to add a new reference (2.8% → 48.9%).
For KPI #1b, this is further limited to constructive new-content edits that were not reverted within 48 hours.

Question #2 KPI 1.2 refers to the "...proportion of published edits that add new content (T333714) and are constructive..." although T400101#11479377 refers to editors. Might this simply be a typo? Might this reveal a discrepancy between the KPI we defined going into the experiment and what we ended up reporting on? Something else?

Question #3 To be doubly sure I'm following the "Reference Added KPI #1b (Availability / ITT; comparable to 2024)" piece, could you please share what (if anything) about the below is out of alignment with what the experiment proved?

"In order to both directly compare the results of this experiment to those we published in 2024 and to make a more conservative estimate of Reference Check's impact, we compared the test and control groups regardless of whether Reference Check was actually shown (or had the potential to show) during an edit session. In doing so, we found that edits in the test group were more likely to be constructive (not reverted within 48 hrs) and include a net new reference."

Question #4 Could you please share what if (anything) about the below is missing/inaccurate?

2024 experiment2025 experiment (en.wiki)
Increase in likelihood of publishing a new content edit that included a reference when Reference Check was shownDesktop: 2.2x; Mobile: 4.2xDesktop: 1.2x; Mobile: 2.2x

Question #5 What (if anything) about the following interpretation of the bit I've quoted below is out of sync with what the experiment proved? Edits in which Reference Check was shown were 3.2% and 18.2% more likely on desktop and mobile respectively to be constructive regardless of whether someone included a new reference in said edits.

When we account for whether an edit added a new reference, the mobile-web advantage becomes smaller and no longer statistically clear, indicating that part of the benefit may come from increased reference inclusion. Overall, the evidence suggests Reference Check improves constructive editing outcomes, especially on mobile web.

@ppelberg
Thank you for the questions, happy to clarify.

Question #1 How might you combine the three statements quoted below into one sentence?

Reference Added KPI #1b
In one sentence: Among constructive new‑content edits (not reverted within 48 hours), edits where Reference Check was shown were ~2.2× more likely on desktop (30.7%→68.2%) and ~17.5× more likely on mobile web (2.8%→48.9%) to include at least one net new reference compared to eligible edits where Reference Check was not shown.

Question #2 Clarification on KPI unit of analysis:

KPI #1 and KPI #2 are defined and analyzed at the edit (editing session) level, not the unique-editor level. The denominator is a count of published edits (editing sessions) in the eligible population, and the numerator is the subset of those edits that meet the metric definition. Any references to “editors” are shorthand for “edits made by editors,” not a change in the unit of analysis. For the 2024 referenced data points, I will add a brief clarification note to make this explicit and reduce confusion. This does not change the metric definition, results, or conclusions. No further action needed for the current scope.

Question #3 following the "Reference Added KPI #1b (Availability / ITT)" piece
Question #4 what if (anything) about the below is missing/inaccurate?

Comments on 3 & 4 are generally aligned; some tweaks for precision in case it's helpful and to reduce any potential misunderstanding:
To compare with the 2024 report, we compared edits assigned to the test versus control group—whether or not Reference Check was actually shown on a given edit. We found that published new‑content edits in the test-assigned group were more likely to both include a net new reference and remain constructive (not reverted within 48 hours).

Reference Added KPI #1b (Availability / ITT)

2024 experiment (11 wikis)2025 experiment (en.wiki)
Increase in likelihood of publishing a new-content edit that included a reference AND was unreverted within 48 hours (constructive), per edit, by experiment assignment bucketDesktop: 2.2x; Mobile: 4.2xDesktop: 1.2x; Mobile: 2.2x

Notes:

  1. 2024 aggregates results across 11 participating wikis; 2025 reports English Wikipedia only.
  2. The 2024 analysis was tag-based and the 2025 analysis uses the newer consolidated event datasets; the definitions are aligned, but the underlying measurement pipelines differ.

Question #5

Edits shown Reference Check had higher constructive rates in this experiment (+3.2% desktop; +18.2% mobile-web). When we account for whether a new reference was included, the treatment–control difference attenuates (fades), especially on mobile-web, and is no longer statistically clear, suggesting that part of the constructive benefit operates through increased reference inclusion.

In short, consistent with a path where: Reference Check → references added → edits are constructive

@ppelberg
Thank you for the questions, happy to clarify.

Question #1 How might you combine the three statements quoted below into one sentence?

Reference Added KPI #1b
In one sentence: Among constructive new‑content edits (not reverted within 48 hours), edits where Reference Check was shown were ~2.2× more likely on desktop (30.7%→68.2%) and ~17.5× more likely on mobile web (2.8%→48.9%) to include at least one net new reference compared to eligible edits where Reference Check was not shown.

Understood. Thank you for clarifying.

Question #2 Clarification on KPI unit of analysis:

KPI #1 and KPI #2 are defined and analyzed at the edit (editing session) level, not the unique-editor level. The denominator is a count of published edits (editing sessions) in the eligible population, and the numerator is the subset of those edits that meet the metric definition. Any references to “editors” are shorthand for “edits made by editors,” not a change in the unit of analysis. For the 2024 referenced data points, I will add a brief clarification note to make this explicit and reduce confusion. This does not change the metric definition, results, or conclusions. No further action needed for the current scope.

Makes sense and adding the clarification you described sounds helpful. Thank you.

Question #3 following the "Reference Added KPI #1b (Availability / ITT)" piece
Question #4 what if (anything) about the below is missing/inaccurate?

Comments on 3 & 4 are generally aligned; some tweaks for precision in case it's helpful and to reduce any potential misunderstanding:
To compare with the 2024 report, we compared edits assigned to the test versus control group—whether or not Reference Check was actually shown on a given edit. We found that published new‑content edits in the test-assigned group were more likely to both include a net new reference and remain constructive (not reverted within 48 hours).

Got it. Thank you.

Reference Added KPI #1b (Availability / ITT)

2024 experiment (11 wikis)2025 experiment (en.wiki)
Increase in likelihood of publishing a new-content edit that included a reference AND was unreverted within 48 hours (constructive), per edit, by experiment assignment bucketDesktop: 2.2x; Mobile: 4.2xDesktop: 1.2x; Mobile: 2.2x

Notes:

  1. 2024 aggregates results across 11 participating wikis; 2025 reports English Wikipedia only.
  2. The 2024 analysis was tag-based and the 2025 analysis uses the newer consolidated event datasets; the definitions are aligned, but the underlying measurement pipelines differ.

Understood.

Question #5

Edits shown Reference Check had higher constructive rates in this experiment (+3.2% desktop; +18.2% mobile-web). When we account for whether a new reference was included, the treatment–control difference attenuates (fades), especially on mobile-web, and is no longer statistically clear, suggesting that part of the constructive benefit operates through increased reference inclusion.

In short, consistent with a path where: Reference Check → references added → edits are constructive

Assuming it's accurate for me to interpret this as meaning the below, this all makes senes to me. Thank you, Irene.

New content edits in which people include references are more likely to be constructive.

@ppelberg Yes, in this experiment’s data, new‑content edits that included a net new reference were more likely to be constructive (association, not proof of causation).

Also, edits where Reference Check was shown were more likely to include a net new reference, and edits that included a net new reference were more likely to remain constructive (association, not proof of causation).

Question #5

Edits shown Reference Check had higher constructive rates in this experiment (+3.2% desktop; +18.2% mobile-web). When we account for whether a new reference was included, the treatment–control difference attenuates (fades), especially on mobile-web, and is no longer statistically clear, suggesting that part of the constructive benefit operates through increased reference inclusion.

In short, consistent with a path where: Reference Check → references added → edits are constructive

Assuming it's accurate for me to interpret this as meaning the below, this all makes senes to me. Thank you, Irene.

New content edits in which people include references are more likely to be constructive.

ppelberg updated the task description. (Show Details)