Page MenuHomePhabricator

Make editors aware when they are attempting to add unreliable sources to an article.
Open, Needs TriagePublic

Description

This task is about making editors aware before they add unreliable sources to an article.

Background

As noted in T265163, inexperienced editors often make edits that defy the project they are editing's policies and guidelines. One such policy we see new editors break is not citing reliable sources. [i]

This task is about making editors aware before they violate said sourcing policies.

Components

🌱 This section is a draft.
Work on this task depends on some not-yet-existing "components":

  1. A way for volunteers, on a per project basis, to define, in a machine-readable way, what sources they reached consensus on being reliable and unreliable.
    • Think: unit test analogy and 9-March-2021 conversation with @Esanders.
  2. A way for volunteers to add to and edit the "list" described in "1."
  3. A way for the editing interface to check a source someone is attempting to add "against" the "list" described in "1."
  4. A way to make the person editing aware, in real-time, when they have added a source that defies the project's policies
  5. Optional: a way for the person attempting to add a source to quickly: A) why the source they are attempting to add likely defies the project's policies and/or B) talk with someone about why they think the source they are trying to include does belong in the encycolopedia.

Links

Relevant conversations

Source lists


i. https://en.wikipedia.org/w/index.php?title=User_talk:ENieves1&type=revision&diff=1009512219&oldid=1009437193&diffmode=source

Event Timeline

ppelberg added subscribers: Trizek-WMF, Whatamidoing-WMF.

Thank you @Trizek-WMF and @Whatamidoing-WMF for the links you shared during today's Monday morning meeting. I've added them to the task description'sLinks` section.

Samwalton9 renamed this task from Make editors aware when they are attempting to add unrelaible sources to an article. to Make editors aware when they are attempting to add unreliable sources to an article..Mar 9 2021, 1:34 PM

A more fundamental learning might be that citations are needed at all. We could also consider alerting editors when they add new content but don't add a citation. The research team developed a Citation Needed model that could do the heavy lifting on understanding whether a citation is needed for a given piece of text.

Communities already have two tools to block some links: the local blocklist or AbuseFilter. At the moment, it is not possible to know if your link will be blocked before you hit "publish", and, when your edit is blocked, the faulty link is not highlighted.

The spam blacklist tells you which link caused your edit to be prevented, but doesn't show you where it is in the article, and it's buried under various other text:

image.png (231×589 px, 23 KB)

ppelberg added a subscriber: MMiller_WMF.

Task description udpate

A more fundamental learning might be that citations are needed at all. We could also consider alerting editors when they add new content but don't add a citation.

Great spot and agreed.

The research team developed a Citation Needed model that could do the heavy lifting on understanding whether a citation is needed for a given piece of text.

@Samwalton9 this is the first time I'm hearing of this...can you confirm this is the research you were referring to https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements ?

@Samwalton9 this is the first time I'm hearing of this...can you confirm this is the research you were referring to https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements ?

That's the one :)

Regarding the machine-readable storing of reliable/unreliable classifications, I have a couple of thoughts. First, @Newslinger has been working on a tool to take the English Wikipedia's table and turn it into a more usable format - looks like you can read more about that here.

Second, I've been feeling cautious about the idea of telling users explicitly what is or isn't reliable as they edit. On the one hand it seems like an obvious thing to do - the community already has this table with encoded community conventions which we could make available to new users more readily. On the other hand we would risk strengthening Wikipedia projects' biases around sourcing. We might want to make sure we design such a feature in a way that doesn't actively discourage adding sources which aren't in the list yet (i.e. don't train editors to look for a sign of approval that a source is definitely reliable). We already have a problem with editors not understanding what sources are reliable in different languages/countries/contexts (see an effort to alleviate this issue at Wikipedia:New page patrol source guide. Maybe this is an inherent tension with attempting to codify fuzzy norms and practices.

Thanks for the ping, @Samwalton9.

Here is an example of what the machine-readable data for the English Wikipedia's perennial sources list looks like in JSON form:

https://api.sourceror.org/v1/all_entries

The data is scraped and parsed from the perennial sources list. This format can be adapted for equivalent source lists on other Wikipedias. The Wikidata entry links to several non-English lists, two of which can be parsed to a machine-readable format:

On the English Wikipedia, the AbuseFilter and blocklist features are able to handle some use cases for this data (deprecation and blacklisting, respectively). However, there are limitations that reduce the effectiveness and hinder community acceptance of these technical measures:

  • As of February 2021, the Wikipedia apps for Android and iOS are not able to display edit filters, according to the table in this noticeboard discussion.
  • There is currently no way to apply blocklist entries to a selection of pages. All patterns on the blocklist apply to all pages on the project.
  • Neither the AbuseFilter nor the blocklist provides a simple way to target or ignore content additions to particular sections of a page.
  • As Samwalton9 mentioned, the messaging associated with these technical measures could be improved. The community can handle some of this, but it would be helpful to have more data available that could be incorporated in the template messages. For example, when a user adds a link that is either deprecated or blacklisted, the error or warning message should ideally show the paragraph surrounding the link.

Peter and I have been discussing using the Spam blacklist as a starting point for this, as entries there are more easily categorisable as obviously undesirable, whereas the perennial sources list has many entries with edge cases and nuances and is as-yet a few steps further removed from something the editor could easily parse (especially so because it doesn't exist on all Wikis). I did some investigation today looking through the English Wikipedia blacklist log and found that entries can be broadly categorised into the following (often overlapping) buckets: Spam, URL shorteners, and unreliable sources. The following is some notes on how ~200 randomly selected log entries broke down:

  • Spam (40%): These were entries clearly designed to lead readers to some website selling a product, hosting a suspicious file, or otherwise of no encyclopedic value whatsoever.
  • URL shorteners (35%): These are entries which introduced links to websites like bit.ly, youtu.be, or Google Amp. These are on the Spam Blacklist because they can disguise their destination, but I was surprised at the volume of hits this category receives. It's worth pointing out that many of these links might have been to spam sites, but I'm sure many were good faith edits.
  • Unreliable sources (25%): These links appeared like they could be useful references for articles. While I'm sure many have been spammed or aren't even remotely reliable, I could imagine most of these link additions having been made in good faith by a new user.

I'm posting this here because I think this backs up the idea of the spam blacklist being a sensible place to start - if 90%+ of hits were clearly from spam bots I might have suggested another approach, but as much as 60% of spam blacklist hits are at least potentially being made in good faith and many more entries on the blacklist are to unreliable sources than I had previously thought.

We could imagine three lanes of guidance based on this categorisation, explaining that the user's edit won't successfully be saved, and then providing guidance to move away from a URL shortener, use a more reliable source, or to check that a link isn't complete garbage. This is prompting me to think about how we could facilitate categorisation on the Spam blacklist to match entries to these specific guidance paths; I'm not sure how that would work right now.