Page MenuHomePhabricator

Generate a list of references people cite when adding new content
Closed, ResolvedPublic

Description

In service of establishing a baseline for the reliability of the references people accompany new content with (T346981), we'll first need to know what references people are adding to the wikis.

The second part of the above – "...what references people are adding to the wikis..." – is what this task is intended to help us learn and serve as the denominator we use in T346981.

Requirements

  1. For each edit made within the past 90 days at the wikis listed below that involve people adding new content with a reference using the VisualEditor [i], we'd like to know:
    • What type of source was being added [ii]
    • The name of said source
  2. Ideally, the output of "1." could be presented as aggregate counts (overall, by wiki, and by experience level)

Focus wikis

For the purpose this analysis, we'd like to include edits made to the following wikis [iii]:

  1. af.wiki
  2. ar.wiki
  3. de.wiki
  4. en.wiki
  5. fr.wiki
  6. ha.wiki
  7. ig.wiki
  8. pt.wiki
  9. sw.wiki
  10. yo.wiki

i. As defined by both of the following edits tags being appended to said edits: editcheck-newreference and editcheck-newcontent.
ii. https://www.mediawiki.org/wiki/Citoid/itemTypes
iii. This list of wikis is borrowed from T332848#9046454

Event Timeline

MNeisler triaged this task as Medium priority.
MNeisler moved this task from Triage to Upcoming Quarter on the Product-Analytics board.

Suggested approach:

  1. Use Mediawiki action API to retrieve a random list of articles from identified wikipedias (including page id and title). The sample pull with be limited to articles in the main namespace (ns = 0).
  2. Query Mediawiki action APIL again to retrieve article data including URL and article content.
  3. Query Citoid APl to find citation metadata for the list of URLs found on the previous step.
  4. Parse content to retrieve citations and associated metadata within the article
  5. Use the list of revisions ids to identify which ones that meet the following requirements:
    • Made with Visual Editor
    • Identified as adding new content (editcheck-newcontent ) that include a reference (editcheck-newreference)
    • Published within the last 90 days

@MNeisler per our conversation yesterday, I put together the start of a notebook for collecting stats on URLs being added as flagged by edit check. I didn't do anything with the Citoid side but did join in stats on how often each URL domain appeared in that wiki overall. I just ran it on French Wikipedia for July/August/September (October data not available yet via mediawiki_wikitext_history) but adding in the other wikis should be trivial and presumably it could be expanded out to meet other needs without too much additional work.

Notebook: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/references/reference-changes.ipynb
Or alternatively stat1008:/home/isaacj/qual_model/reference-changes.ipynb

Thank you @Isaac! This was very useful. I was able to modify the notebook to include all the identified focus wikis and some additional parameters from mediawiki. I also expanded the notebook to include some earlier analysis I did to retrieve metadata on the source type and source title where available from the Citoid API.

Per my conversation with @ppelberg, the next steps are for me to explore the resulting dataset and pull together a summary of the following details aggregated by domain and wiki:

  • Domain
  • Wiki
  • Count of edits that cited domain
  • Count of pages that use the domain as a reference
  • Number of domain occurrences (how often the URL appears in wiki overall. (Stat already available using notebook from @Isaac)
  • Revert rate. Of the edits made during the 90-day window that cited domain as a source, what proportion were reverted (by wiki)
  • Optional: Add in duration a reference in the sample remains on the page after publishing (persistence as a metric of reliability)

This aggregate data will be reviewed to help identify an approach for measuring source reliability in T346983 in addition to input from volunteers on the reliability of certain domains.

Notebook: https://gitlab.wikimedia.org/mneisler/edit-check-references-2023/-/blob/main/Queries/collect_revision_citation_data.ipynb

This was very useful. I was able to modify the notebook to include all the identified focus wikis and some additional parameters from mediawiki. I also expanded the notebook to include some earlier analysis I did to retrieve metadata on the source type and source title where available from the Citoid API.

@MNeisler glad to hear! One thought after reading through the notebook: I wonder for the Citoid itemtypes if it's important to retain the full URL because e.g., while un.org would be a webpage, the reference might be to a report hosted by the UN? Hopefully expanding from domains to full URLs doesn't totally explode the number of calls to Citoid (if so could presumably just do a sample and generalize). I think it'd be pretty straightforward to just retain the full URL in getNewDomains and store that in the final table instead of the domain (or alongside if desired).

I've finished retrieving all citation data that accompanied new content edits made in the last 90 days. Per @Isaac's suggestion, I modified the notebook and resulting dataset to retain full URLs in addition to the URL. This helped provide more specific and accurate source type details for urls where a Citoid response was available.

@ppelberg - Please see the updated spreadsheet, which now includes the following a summary of domain stats by wiki (with the details specified in T346982#9339179 ) and a sample of citations at each specified wiki for review.

  • The first worksheet ("Domain summary stats") includes a summary of domain stats by wiki (with the details specified in T346982#9339179)
  • The second worksheet ("Citation sample" includes a sample of new content revisions and associated citations for review (I included a random sample of 100 revisions per wiki as a start but this can be extended if helpful).

Some initial findings from exploration of the dataset:

  • The most frequently cited domain by junior editors (< 100 edits) across all the identified wikis was youtube.com. About 3.9% of all new content edits by Junior Contributors across all identified wikis (excluding enwiki) included at least one reference to this domain. There were 5159 new content edits by Junior Contributors during the reviewed timeframe and 200 of these included at least one reference to youtube.
  • New content references that included references to other articles on wikipedia were also frequently reverted. 21% of 207 new content edits that included a reference to another wikipedia articles were reverted.
  • Reverted edits appear far more likely to link to domains with fewer occurrences elsewhere on the wiki (at the time of the externallinks snapshot). This was true overall and by wiki.

Data Description
Found all domains added by edits that met the following requirements:

  • Made with Visual Editor
  • Identified as adding new content (editcheck-newcontent) that include a reference (editcheck-newreference)
  • Published within July-September 2023
  • Bots excluded
  • Occured on main namespace (page namespace = 0)

Potential next steps (if needed):

  • Add in duration a reference in the sample remains on the page after publishing
  • Further analysis using the resulting citation dataset from this task to help identify an approach for measuring source reliability in T346983

Notebook

I've finished retrieving all citation data that accompanied new content edits made in the last 90 days. Per @Isaac's suggestion, I modified the notebook and resulting dataset to retain full URLs in addition to the URL. This helped provide more specific and accurate source type details for urls where a Citoid response was available.

Thank you for pulling this together, @MNeisler. A couple of responses in-line below.

Before that, are you able to expand/update this dataset and the analysis of it to include en.wiki? I realized just now that I neglected to include en.wiki in the task description.

@ppelberg - Please see the updated spreadsheet, which now includes the following a summary of domain stats by wiki (with the details specified in T346982#9339179 ) and a sample of citations at each specified wiki for review.

  • The first worksheet ("Domain summary stats") includes a summary of domain stats by wiki (with the details specified in T346982#9339179)
  • The second worksheet ("Citation sample" includes a sample of new content revisions and associated citations for review (I included a random sample of 100 revisions per wiki as a start but this can be extended if helpful).

Might it be possible to expand this spreadsheet to include all edits?

Some initial findings from exploration of the dataset:

  • The most frequently cited domain by junior editors (< 100 edits) across all the identified wikis was youtube.com. About 23% of all new content edits by Junior Contributors included a reference to this domain. Edits that included a reference to this domain were also more frequently reverted. There were 200 new content edits that referenced youtube and 16% were reverted.

Extrapolating from the above, would it be accurate for me to think that among all of the new content edits people made within the past 90 days using the visual editor at the wikis listed in the task description (excluding en.wiki), ~870 of those new content edits were made by junior contributors (200/.23)?

Related: would it be accurate for me to think that if I sum the Number of new content edits that included domain column in the Domain summary stats sheet I'll arrive at the total number of new content edits people made within the past 90 days using the visual editor at the wikis listed in the task description (excluding en.wiki) (45,203 new content edits with VE)?

@ppelberg
I'm working on making the adjustments identified above.

Please see responses to questions below:

Extrapolating from the above, would it be accurate for me to think that among all of the new content edits people made within the past 90 days using the visual editor at the wikis listed in the task description (excluding en.wiki), ~870 of those new content edits were made by junior contributors (200/.23)?

Sorry the above statement was incorrectly looking at a subset of data. See corrected statement below (and as revised in T346982#9364769) to accurately reflect all reviewed wikis (excluding enwiki).

"The most frequently cited domain by junior editors (< 100 edits) across all the identified wikis was youtube.com. About 3.9% of all new content edits by Junior Contributors across all identified wikis (excluding enwiki) included at least one reference to this domain. There were 5159 new content edits by Junior Contributors during the reviewed timeframe and 200 of these included at least one reference to youtube."

Edits that included a reference to this domain were also more frequently reverted. 29 of the 200 new content edits by Junior Contributors that included a reference to youtube were reverted (14.5%).

Related: would it be accurate for me to think that if I sum the Number of new content edits that included domain column in the Domain summary stats sheet I'll arrive at the total number of new content edits people made within the past 90 days using the visual editor at the wikis listed in the task description (excluding en.wiki) (45,203 new content edits with VE).

Not exactly as a single revision can contain references to multiple domains. For the reviewed timeframe, there were 23,919 distinct new content edits made within the past 90 days using the visual editor at the wikis listed in the task description (excluding en.wiki). About 35% of these new content edits reference more than one domain.

I will make adjustments to the sample spreadsheet to clarify the above.

@ppelberg

Before that, are you able to expand/update this dataset and the analysis of it to include en.wiki? I realized just now that I neglected to include en.wiki in the task description.

No problem. I've expanded the dataset to include enwiki.

Might it be possible to expand this spreadsheet to include all edits?

Yes but I recommend viewing the data within Superset as it's too large for Google to handle. The complete dataset is saved in mneisler.references_selectwikis_2023_09 which can be accessed and queried via Superset.

To assist with viewing and further exploration, I've created an initial Superset dashboard from this dataset that includes an aggregate table of the "domain summary stats for select wikis" and a table with the raw dataset ("new content revisions that include a reference for select wikis"). From here, it should be fairly easy to query the data and add charts to view different groupings or metrics.

Note these tables just represent an initial summary view of the data. Let me know if there are any specific aggregations or analyses that would be helpful to provide insight into the types of citations added with new content and to inform our approach to determining source reliability.

Note: I've linked the dashboard generated in this task to T346981 where I think additional analysis on this dataset might be helpful.

Recommend resolving this task for now as the dataset has been generated.