Page MenuHomePhabricator

[BASELINE] What percentage of edits are reverted because of copyvio risk?
Closed, ResolvedPublic

Description

In T359107, we are exploring the viability of a Check that would prompt people pasting content into the visual editor to decide whether they think the content they're pasting is at risk of creating a copyright violation.

This task involves the work of learning what percentage of newcomer edits are reverted because of this behavior (split out by mobile and desktop) so that, ultimately, we can decide how to surface feedback of this sort.

Decision(s) to be made

  1. How will we treat copyvio-related feedback and ultimately, present it to people?
    • E.g. Do we consider feedback of this sort essential to communicate to people before they publish an edit because NOT doing so would likely result in the edit being reverted? Do we consider feedback of this sort something that could be presented as a suggestion later on? Both?
  2. What percentage of newcomer edits to the Wikipedia main namespace are reverted because of the presence of potential copyright violations?

Requirements

  1. Review a sample of 100 edits to the main namespace made by newcomers from the Special:Log/delete at ar.wiki, en.wiki, and fr.wiki, and document the number of times a page is deleted because of a copyright-related issue
  2. Review a sample of 100 reverted edits made by newcomers to the main namespace from Special:RecentChanges at ar.wiki, en.wiki, and fr.wiki and document the number of times an edit is reverted because of a copyright-related issue
  3. Document the edits you review through "1." and "2." (above) in this spreadsheet: Edit Check

Event Timeline

ppelberg updated the task description. (Show Details)
ppelberg moved this task from Backlog to Movement Communications on the Editing-team (Tracking) board.

I'm assigning this task over to Benoît, to coordinate with @Geugeor and @Dyolf77_WMF about completing the ===Requirements described in the task description.

A quick note that another way to investigate this data for pages, at least on English Wikipedia, could be to parse deletion reasons - on en.wiki most pages deleted for copyvio will have "G12" in the summary.

A quick note that another way to investigate this data for pages, at least on English Wikipedia, could be to parse deletion reasons - on en.wiki most pages deleted for copyvio will have "G12" in the summary.

Yes, I mentioned this during our meeting yesterday, so as the-other-criteria-I-can't-find-back. Not to mention the non-tagged deleted edits! :D

It is good to highlight it as the ambassadors who will review the samples aren't familiar with English Wikipedia.

@ppelberg, none of us is skilled in extracting the data we need. Can we ask an engineer to help there?

@ppelberg, none of us is skilled in extracting the data we need. Can we ask an engineer to help there?

Per what @Trizek-WMF and I discussed offline today, we'll start with requirement #2 (looking at Special:RecentChanges) while we work on parsing the Special:Log/delete per the suggestion @Samwalton9-WMF helpfully made in T376064#10191123.

ppelberg renamed this task from [SPIKE] What percentage of edits are reverted because of copyvio risk? to [BASELINE] What percentage of edits are reverted because of copyvio risk?.Jul 21 2025, 6:57 PM
ppelberg removed Trizek-WMF as the assignee of this task.

Per what @MNeisler and I discussed offline today, we could also consider an approach of calculating the revert rate of edits on the grounds of WP:Copyvio in the following way:

  1. Introduce a new tag that gets appended to edits that include unmodified pasted text from a non-Wikipedia HTML source
  2. Calculate the rate at which edits with this would-be tag are reverted.

Note, this approach makes some notable assumptions and omissions:

  1. Assumes volunteers reverting an edit with unmodified pasted content directly corresponds to a potential copyright violation
  2. Omits edits reverted on the grounds of WP:COPYVIO that do not include someone pasting text into a VE edit session

Documenting another quantitative approach that can be used to calculate the revert rate:

  1. Use a query to pull a sample of reverted edits from mediawiki_history
  2. Parse the edit summaries to identify reverted edits that use include WP:COPYVIO (or other identified copyright policy or text).
  3. Calculate the percentage of reverted edits in sample that were reverted on the grounds of WP:COPYVIO.

Some assumptions and omissions with this approach:

  • This will omit any edits where an editor reverted an edit dut to copyright violations but did not specifically include WP:COPYVIO (or other identified copyright policy or text) with their revert reason.
  • The frequency of identifying the copyright policy in the edit comment may vary based on practices at each wiki. However, we can review baseline rates per wiki as part of this task to help identify any discrepancies.

Per today's offline discussion, @MNeisler and I decided to move forward with the approach Megan proposed in T376064#11062409.

The primary reason: the data required to complete this analysis is immediately available (read: it is not blocked by the work to introduce any new tags (T379843).

Once work on T379843 is complete, we will consider prioritizing an analysis that takes the approach T376064#11041545 outlines and comparing the two results.

I'm currently working on calculating a baseline by parsing the revert comments available in mediawiki_history (See approach outline in T376064#11062409). There are a few open questions noted below that need to be resolved prior to completing this task:

(1) What specific copyright-related terms or policies should be searched for in addition to "WP:COPYVIO"?
(2) What Wikipedias should be included in this initial baseline analysis? (The list of wikis will likely be needed to inform the terms to be defined in Question 1).

cc @ppelberg

I'm currently working on calculating a baseline by parsing the revert comments available in mediawiki_history (See approach outline in T376064#11062409). There are a few open questions noted below that need to be resolved prior to completing this task:

(1) What specific copyright-related terms or policies should be searched for in addition to "WP:COPYVIO"?

Good question; I'll report back with what I learn during tomorrow's (18 Aug) Product Ambassador meeting.

(2) What Wikipedias should be included in this initial baseline analysis? (The list of wikis will likely be needed to inform the terms to be defined in Question 1).

To start, let's include wikis that A) see a relatively large population of newcomers each month and B) are wikis we know the newcomers we're centering in the Edit Check work (those editing from within Sub-Saharan Africa). In list form:

  1. English Wikipedia
  2. Spanish Wikipedia
  3. French Wikipedia
  4. Persian Wikipedia
  5. Japanese Wikipedia
  6. Portuguese Wikipedia
  7. Russian Wikipedia
  8. German Wikipedia
  9. Italian Wikipedia
  10. Chinese Wikipedia
  11. Korean Wikipedia
  12. Indonesian Wikipedia
  13. Arabic Wikipedia
  14. Ukrainian Wikipedia
  15. Polish Wikipedia
  16. Turkish Wikipedia
  17. Dutch Wikipedia
  18. Hebrew Wikipedia
  19. Vietnamese Wikipedia
  20. Czech Wikipedia

Approach
I completed a review of copyright revert rate using the following two approaches: (1) parsing revert summaries of recent revisions in the main namespace and (2) parsing deletion reasons of deleted pages for mentions of any copyright-related terms identified in T402601.

I reviewed data from January 2025 through July 2025 and limited the review to Wikipedias where we were able to identify a number of copyright-related terms (as collected in T402601). Data was also limited to edits completed by people with 100 or fewer edits and anon people to align with the target audience for Paste Check.

Summary of results:
There is a only a small proportion of edits by new(er) editors reverted or deleted with direct mention of a copyright violation-related term or policy in the revert or deletion summary. Without further investigation and/or instrumentation, it is unclear if these findings are because (1) reverts due to copyright issues occur less frequently compared to other revert types or (2) revert and/or deletion summaries for these types of edits often do not include an explicit mention of the identified copyright policy or terms.

Data:

  • Across all reviewed Wikipedias, 0.3% of all published new content edits were reverted due to a copyright violation. This accounts for 1% of all reverted new content edits.

By Platform:

  • Mobile Web: 0.2%
  • Desktop: 0.3%

Proportion of all published edits reverted with mention of copyright issue by Wikipedia:

wikiProportion reverted
cswiki0.36%
dewiki0.12%
enwiki0.52%
eswiki0.21%
frwiki0.2%
itwiki0.13%
plwiki0.17%
ptwiki0.43%
viwiki0.21%
zhwiki0.61%

Proportion of all deleted pages deleted due to mention of copyright violations by Wikipedia:

wikiproportion deleted
cswiki3.88%
dewiki0.45%
enwiki0.82%
eswiki0.43%
frwiki0.49%
idwiki0.22%
itwiki1.87%
nlwiki1.39%
plwiki0.4%
ptwiki0.9%
viwiki1.52%
zhwiki9.12%

cc @ppelberg review and confirmation of next steps

Nothing further to pursue at this time; thank you, Megan.