Page MenuHomePhabricator

Explore citations included with revisions by editor experience and revert rate
Closed, ResolvedPublic

Description

Generate a sample of citations added by revisions across several different wikis and include data on editor experience and the revert rate or revision risk of revisions with that citation.

This dataset will be used to explore citation usage by different editor types. Are there common trends in the types of citations that newcomers add vs more experienced users? What types of citations are frequently reverted?

Analysis can help provide insights into source reliability and help identify policies that would be helpful to provide newcomers adding a citation to an edit.

Event Timeline

Methodology

Updated the notebook used in T346982 to obtain a list of references (domain and urls) included with new content edit. To identify new content edits, I used the editcheck-newcontent tag which tags any VisualEditor edit that meets the conditions that were defined in T324730 and codified in editcheck/modules/init.js.

Using data contained in mediawiki_history, I also queried other data associated with the new content edit including wiki, user experience level, and if it was reverted within 48 hours. I also added in if reference check was shown during the edit (the first iteration of edit check, which prompts a user to add a citation if they have not done so already prior to publishing).

Wikis Reviewed: arwiki, afwiki ,eswiki, frwiki, itwiki, jawiki, ptwiki, swwiki, yowiki, viwiki, zhwiki, enwiki.
I decided to include wikis included in the Reference Check AB test as well as English Wiki as a comparison point. This provides a good mix of difference size wikis for the analysis.

Timeframe: 18 April 2024 through 31 March 2024

Definitions
Number of domain occurrences: how often each URL domain appeared in that wiki overall.
Reverts: edits reverted within 48 hours after being published.

Experience levels:
I reviewed contributors by the following experience levels based on cumulative edits:

  • Newcomers: 1 edit
  • Junior Contributors: 100 or fewer edits
  • Senior Contributors: Over 100 edits

Note: I'd like to research better ways to account for editing activity and also identify other factors that may impact a user's experience level.

Findings

Overall across reviewed wikis

Overall top referenced domains based on number of new content edits

domainnum_domain_occurrencesn_urlsn_revisions
https://www.youtube.com778390513340
https://books.google.com2186677589332
https://www.nytimes.com630026383306
https://www.jstor.org2310872330267

Overall top 5 reverted domains
Note: limited to domains with over 50 logged edits.

DomainProportion of new contents edits with the domain that are reverted
http://dx.doi.org30%
https://en.wikipedia.org29%
https://fr.wikipedia.org22%
https://www.nature.com21%
https://twitter.com15%

Note: This does not necessarily mean the edit was reverted because of the domain. Some may have been reverted due to the content of the edit itself. We would need to investigate the cause of revert to understand more.

It seems likely that both the twitter.com and wikipedia domains were reverted because the reference was not appropriate. Tweets and references to other Wikipedias' articles are typically identified unacceptable as sources for new article content.

dx.doi.org is a site that links to published content using the digital object identifier (DOI). DOI's are accepted on enwiki to link to published content per https://en.wikipedia.org/wiki/Wikipedia:Citing_sources. It's possible this appears on this list just due to frequency of use as a reference and these edits were reverted due to the content of the edit itself (or potentially there is an issue with how these references are generated). I would need to investigate to confirm.

Domains used across multiple wikis
The majority of domains were used only on one of the reviewed wikis. However, 10% of the identified domains were included as reference on more than wiki. Here are the top 3 domains referenced on multiple wikis.

The above domains were referenced on 10 out of the 12 reviewed wikis.

Newcomer vs Senior User Trends
  • Newcomers are 9 times more likely to post a new content edit with a reference that is reverted compared to Senior contributors. Overall 18% of new content edits with a reference by newcomers were reverted compared to 2% of new content edits by Senior Contributors during the reviewed timeframe. (16% by unregistered users).
  • Senior editors also reference more distinct domains compared to newcomers. During the reviewed timeframe, senior editors referenced 16,112 distinct domains compared to 1,856 distinct domains added by newcomers.

Top domains included with newcomer new content edits

DomainProportion of new content edits that include referenceRevert Rate
https://www.youtube.com1.7%17%
http://dx.doi.org1.5%32%
https://doi.org1.2%3.6%
https://www.jstor.org0.82%10%
https://onlinelibrary.wiley.com0.74%5.5%

Youtube is the most frequently referenced domain by newcomers. Youtube is considered generally unreliable per enwiki's perennial source list. Guidance at other projects may vary. About 17% of edits by newcomers that include this reference are reverted.

Let's check the domains that are most frequently included with reverted new content edits by newcomers to see how this compares:
Top domains included with newcomer new content edits that are reverted

DomainProportion of new contents edits with the domain that are reverted
https://twitter.com36%
http://dx.doi.org32%
https://www.instagram.com23%
https://en.wikipedia.org17%
https://www.youtube.com17%

There are a number of domains listed that would generally be considered unreliable. Except for Youtube and dx.doi.org, these domains are not frequently referenced by newcomers but it helps highlight the types of references that cause their edits to be reverted (e.g. references to social sites and links to other Wikipedia articles).

Now let's compare to Senior Contributors.

Top domains included with senior new content edits*

DomainProportion of new content edits that include referenceRevert Rate
https://books.google.com1.1%0.7%
https://www.theguardian.com1%2.7%
https://www.bbc.com0.9%3.7%
https://www.nytimes.com0.8%3%
https://www.youtube.com0.7%5%

Except for youtube, all the top domains included by senior editors differ from newcomers. Edits by senior contributors that add youtube as a reference are reverted significantly less than edits by newcomers that add youtube as a reference (5% compared to 17%).

Top domains included with senior new content edits that are reverted
Note: Limited to domains with over 50 new content edits

DomainProportion of new contents edits with the domain that are reverted
https://twitter.com7%
https://www.youtube.com5%
https://www.cnn.com3.7%
https://www.bbc.com3.7%
https://www.nytimes.com3%

Overall there are very few new content edits with a reference by senior contributors that are reverted. All of these domains are used in less than 1% of new content edits by senior contributors.

Similar to newcomers, Twitter is the most frequently reverted domain by senior editors; however the revert rate is much lower. 7% of all new content edits that included a reference to Twitter were reverted.

Per wiki Analysis

The results above largely reflect English Wikipedia as it represents the majority of new content edits published during the reviewed time frame. I also reviewed individual wikis trends to identify any differing or similar per wiki trends.

Revert rate of new content edits with a reference by experience level and wiki

references_newcontent_revert_wiki_exp.png (1×1 px, 193 KB)

New content edits with a reference by newcomers have a much higher revert rate than senior editors across all reviewed wikis. Italian and Vietnamese Wikipedia have an especially high revert rate for newcomers; however, both of these wikis only had around 50 new content edits with a reference by newcomers during the reviewed time frame.

Top Domains Referenced By Newcomers vs Senior Contributors

The top referenced domains by newcomer new content edits during the reviewed timeframe were similar to overall results. Youtube, Doi.org, and references to Wikipedia articles were common by newcomers, while senior editors on these wikis typically added references to news sites.

Summary

  • Newcomers are 9 times more likely to post a new content edit with a reference that is reverted compared to Senior contributors. 
  • References that are commonly added by newcomers and reverted include references to YouTube, other Wikipedia articles, and social network sites. Additional guidance to newcomers on policies around referencing these types of references could be beneficial in reducing revert rate. I’ll reference this finding in the research being done in T354303 to identify future types of Edit Checks.
  • New content edits by newcomers that reference Dx.Doi.org are also frequently reverted. This may be due to an issue with the content of the edits themselves and/or issues with the citation generation by citoid. Further investigation is needed to confirm.
  • Most of the top domains referenced by senior editors differ from the ones added by newcomers with significantly lower revert rates. Top domains included books.google.com and news sites with revert rates of less than 3%.
  • Senior editors also included references to YouTube but these edits are reverted significantly less than edits by newcomers that added Youtube as a reference (5% compared to 17%).

Code Repo

@MNeisler: Thanks for participating in the Hackathon! We hope you had a great time.

  • If this task was being worked on and resolved at the Hackathon: Please change the task status to resolved via the Add Action...Change Status dropdown, and make sure that this task has a link to the public codebase.
  • If this task is still valid and should stay open: Please add another active project tag to this task, so others can find this task (as likely nobody in the future will look back at the Hackathon workboard when trying to find something they are interested in).
  • In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to declined.

Thank you,
Phabricator housekeeping service