Page MenuHomePhabricator

[Analysis] What percentage of edits that add new information include ≥1 reference
Closed, ResolvedPublic

Description

With the ability to:

  1. Detect whether an edit involves someone adding a reference (T325713) and
  2. Detect whether an edit involves someone adding new information (T333714)

...we'd like to use this ticket to learn what proportion of edits that involve people adding new information also include them adding a reference.

Decision(s) to be made

Knowing the proportion of edits that involve people adding new information and a reference will enable the Editing Team to decide:

  • What percentage change in the proportion of edits that involve people adding new information that also include a reference should we expect Edit Check to cause?

Research questions

  • 1. What percentage of main namespace Wikipedia edits that involve people adding new information include a reference? How do these percentages vary by wiki, editor experience level, and editing interface?
  • 2. What percentage of these edits are reverted? How does that compare to the revert rate of the new content edits that do not include a reference?

Event Timeline

ppelberg moved this task from Backlog to Analytics on the Editing-team (Tracking) board.
ppelberg moved this task from Backlog to Triaged on the EditCheck board.
MNeisler triaged this task as Medium priority.
MNeisler added a project: Product-Analytics.
MNeisler moved this task from Triage to Upcoming Quarter on the Product-Analytics board.

Currently blocked on the deployment of edit tags in T333714 and T325713.

The new edit tags to identify edits that involve people adding new content (editcheck-newcontent ) and edits that include a reference (editcheck-newreference) have been deployed. I completed an initial QA check today and confirmed that both of these tags are currently being recorded in the database.

I'll plan to start this task next week after we log about a week worth of data.

@ppelberg Results from this analysis are summarized below. Please let me know if you have any questions.

Methodology
I reviewed data logged in the mediawiki_revision_tags_change table to identify the proportion of published edits tagged with the new hidden change tag editcheck-newcontent that also include the change tag editcheck-newreference (see tag definitions below). The results below reflect data logged between 7 July 2023 (after both tags were deployed) to 22 July 2023 across all Wikipedia main namespaces. Bots were excluded.

  • editcheck-newreference: Implemented in T325713 to identify all edits made using the visual editor to pages in the main namespace that involve an edit where people add a new (non-reused) reference. Deployed on July 6th. Note: This way this is currently implemented does not count the re-use of an existing reference.
  • editcheck-newcontent: Implemented in T333714 to identify all edits made using the visual editor that meet the edit check heuristic conditions defined in T324733 with the exception that this tag does not consider whether a new reference was added as part of the edit in question. Deployed on July 3rd.

Results
Overall:
20% of new content edits made with VE include a new reference.

Total new content editsTotal new content edits with a new referenceProportion of new content edits with a new reference
885231801120.3%

By Editor Logged In Status:
A higher proportion of new content edits by registered editors include a new reference compared to unregistered editors

User statusTotal new content editsTotal new content edits with a new referenceProportion of new content edits with a new reference
registered703051561322.2%
unregistered18228240013.2%

By Editor Experience:
Newcomers and junior editors are less likely to add a new reference with their new content edit compared to more senior editors. As shown in the chart below, the proportion of new content edits with a reference increases as user experience increases.

new_content_edits_byuserexp.png (549×886 px, 47 KB)

12% of new content edits completed by newcomers (users completing their first edit as a registered user) include at least one new reference while 26% of new content edits completed by senior editors include at least one new reference.

By Wiki
Results vary per wiki and range from a high of 36% of new content edits that include a reference at Urdu Wikipedia (urwiki) to a low of 2.4% at Malagasy Wikipedia (mgwiki). Note: I limited the per wiki analysis only to wikis that had over 100 new content edits in the reviewed time period.

Some other results for mid-size to larger wikis:
English Wikipedia: 23.4%
French Wikipedia: 18.9%
Portuguese Wikipedia: 23.4%
Hausa Wikipedia: 19.0%
Catalan Wikipedia: 34%
Spanish Wikipedia: 18.9%

By Revert Status
Note: This includes edits reverted within 48 hours. Some edits may have been reverted past that time.

What proportion of new content edits that include a reference are reverted?:

Total new content edits with a new referenceNumber of these edits revertedProportion of these edits reverted
1801111206.2%

Only about 6.2% of new content edits with at least one new reference are reverted.

How does that compare to the revert rate of new content edits that do not include a new reference?

Total new content edits without a new referenceNumber of these edits revertedProportion of these edits reverted
70519812711.5%

A higher proportion (11.5%) of new content edits that do not include a reference were reverted during the reviewed time period.

This looks great, @MNeisler. Per what we talked about offline today, as a next step we're going to add per wiki breakdowns [i] for:

  1. The proportion of new content edits that include a new reference by editor experience
  2. How the revert rate of new content edits vary between those new content edits that do and do not include references
    • Note: if we can see this broken out by experience level, that would be wonderful.

I also wonder: do you think it would be feasible to report on additional metric that we hadn't previously discussed as being within the scope of this task (see below)?

Proposed additional metric: proportion of all edits to the main namespace made with VE [ii] that involve people adding new new content, overall and broken out by project and experience level?


i. The wikis where we'd like to see project-specific breakdowns include: en.wiki, fr.wiki, sw.wiki, ar.wiki, pt.wiki, ha.wiki, ig.wiki, af.wiki, yo.wiki, de.wiki. via Superset

Screenshot 2023-07-26 at 3.32.42 PM.png (644×1 px, 120 KB)

ii. Read: the same conditions we've used to compute other metrics on this task

@ppelberg
Please see results for additional per wiki breakdowns below:

The proportion of new content edits that include a new reference by editor experience and wiki

new_content_edits_wiki_exp3.png (546×1 px, 113 KB)

Note: swwiki, afwiki, igwiki and yowiki did not have sufficient events for analysis. At least one more month of data is likely needed to include per wiki breakdowns for these wikis.

When broken down by project, we see roughly similar overall trends with new editors least likely to publish a new content edit with a new reference and senior contributors more likely. Some initial observations:

  • Hawiki did not have any new editors that published an edit on a main namespace during the reviewed period and only 8% of edits by junior contributors on this project included at least one new reference (compared to 22% at English Wikipedia).
  • At both arwiki and ptwiki, senior editors with over 500 edits are 3 times more likely to publish an edit with a new reference compared to new editors. This difference is higher than the difference observed between senior and new editors at the other reviewed wikis.
  • At Ptwiki, we observed the highest proportion of new content edits with new references by senior editors (34% for editors with 100-500 edits and 32% for editors with over 500 edits)
  • The other larger wikis enwiki, dewiki and frwiki follow overall trends with the percent of new content edits that include a reference consistently increasing with experience.

How does the revert rate of new content edits vary between those new content edits that do and do not include references?

Overall by experience level

new_content_edits_reverts_exp4.png (560×1 px, 81 KB)

  • For each editor experience level, new content edits that include at least one new reference are reverted less than new content edits without a new reference.
  • For new editors and junior editors (under 100 edits), there's a higher percent difference between the revert rate of new content edits with a reference and without a reference. For example, the revert rate of new content edits by junior editors (under 100 edits) decreased by about 42% (15% → 8.8%) when at least one new reference was included compared to decreasing by 28% (1% -> 0.75%) for senior editors (over 500 edits).

By Wiki

new_content_edits_reverts_wiki2.png (548×1 px, 83 KB)

For each reviewed project, new content edits that include a new reference are reverted less than new content edits that do not include a new reference but the percent decrease varies per project.

  • Ptwiki has the lowest revert rate (0.63%) of new content edits that include a reference. This is about a -92% decrease (8.8% → 0.63%) from new content edits that do not include a reference.
  • At arwiki, the inclusion of a reference had a smaller impact on the likelihood of an edit being reverted. At this wiki, there was a -20.5% decrease (12.9% → 10.25%) in the proportion of new content edits reverted when a new reference was included.
  • Hawiki did not have any new content edits reverted during the reviewed timeframe so it is not shown in the cart above.

By wiki and experience level
To understand the impact of editor's experience level factors on their likelihood of their edit being reverted, I also reviewed revert rates by experience level for each project.

The above chart shows the proportion of new content edits reverted by whether a new reference was included for each wiki (represented by each row) and experience level (represented by each column).

new_content_edits_revert_wikiexp3.png (826×1 px, 121 KB)

  • Enwiki, frwiki, and dewiki had similar trends with new content edits by new editors reverted more frequently than more senior editors. If the new content edit included a new reference, then the proportion of edits reverted decreased for each experience level. Higher percent decreases were observed for new editors and junior contributors compared to senior editors.
  • At enwiki, 40.4% of new content edits by new editors without a reference are reverted. This is a 50% percent increase compared to the rate observed on other wikis.
  • At ptwiki, a very small proportion of the 500 new content edits that included a reference were reverted (0.6%). These few reverted edits were completed by junior editors with between 1-99 edits and senior editors.
  • At arwiki, a high proportion (35.7%) of new content edits by junior editors that included a reference were reverted. In contrast to other per wiki and overall trends, this is higher than the observed revert rate (14.8%) of new content edits that did not include a reference.

Note on per wiki revert rate analysis:

  • This analysis reviewed edits reverted within 48 hours. Some of the reviewed wikis may take a longer time to revert new content edits, which may account for some of the observed per wiki variation in revert rates.

Full Analysis Report

I also wonder: do you think it would be feasible to report on additional metric that we hadn't previously discussed as being within the scope of this task (see below)?

Proposed additional metric: proportion of all edits to the main namespace made with VE [ii] that involve people adding new content, overall and broken out by project and experience level?

Yes this is feasible and would not require too much additional work as it involves just a small change to the existing query. I wonder if it would be worthwhile to track as a separate task since it is a different scope/different question. If ok, I can create one and will track work on that analysis there.

I also wonder: do you think it would be feasible to report on additional metric that we hadn't previously discussed as being within the scope of this task (see below)?

Proposed additional metric: proportion of all edits to the main namespace made with VE [ii] that involve people adding new content, overall and broken out by project and experience level?

Yes this is feasible and would not require too much additional work as it involves just a small change to the existing query. I wonder if it would be worthwhile to track as a separate task since it is a different scope/different question. If ok, I can create one and will track work on that analysis there.

Pursing the answer to this question in a new task sounds great – thank you, @MNeisler.

@ppelberg
Please see results for additional per wiki breakdowns below...

Excellent – this additional analysis offers precisely what we were seeking...thank you, @MNeisler.

I've posted these results on-wiki: https://www.mediawiki.org/wiki/Edit_check#11_August_2023.