Page MenuHomePhabricator

[Signal] Identify cases where reference does not support published claim
Open, Needs TriagePublic

Description

This task involves the work of introducing a signal that would enable people to identify cases where a reference does not support the published claim it is purported to verify.

References

  • "More than two-thirds of [178] articles failed verification. That means the article contained a plausible-sounding sentence, cited to a real, relevant-sounding source. But when you read the source it’s cited to, the information on Wikipedia does not exist in that specific source. When a claim fails verification, it’s impossible to tell whether the information is true or not. For most of the articles Pangram flagged as written by GenAI, nearly every cited sentence in the article failed verification."
  • https://wiki-verifiqator.pages.dev/ via @Alaexis (standalone app for non-editors, see below for user scripts for Wikipedia editors)
  • en:User:The Anome describing a similar-sounding experience:
    • "Here's the idea: a bot reads a Wikipedia articles, and retrieves all the cited sources that are fetchable at that moment, point by point, it compares each paragraph/sentence in the article with the cited sources. If it's all fine, it just marks the article with a review template that states that the article has been auto-reviewed, and when. If any material is either unsupported by the cited material or contradicted by it, it surrounds that material with some variation of {{citation needed span}}, with parameters that specify when it was auto-reviewed and what's wrong with it. Maybe from a small range of choices: "source disagrees", "source does not support", and with a free-text comment. Perhaps it also puts in a short checksum (say 6 hex digits) of the enclosed content, so that changes to that content are easily detectable in later scans. The article is also marked by an invisible template in the same way as above. It could also generate "source unavailable" annotations, or edit URLs if sources get moved." | source
  • Implemeting "ChatBot Validation" for sentences of Wikipedia
  • User:Phlsph7/SourceVerificationAIAssistant.js via @Polygnotus
  • User:Alaexis/AI_Source_Verification - newer version of the above with a free open-source LLM option
  • en:Template:Failed_verification via @Sdkb

Event Timeline

Per offline with discussion with @Sucheta-Salgaonkar-WMF, we think a specialized model would be needed to enable something like what this task is "asking" for.

Just dropping a few quick thoughts in case it's helpful if work is picked up in this space because they've been sitting in my head for a bit and I'm happy to now have somewhere to dump them:

  • Relevant lit:
  • Considerations:
    • This is a classification task and fine-tuned smaller models still tend to do comparable or better than LLMs if there is some good data available for fine-tuning. There's the {{failed_verification}} templates as mentioned above that could be used for this. But also because this is a relatively generic task (not particularly wiki-specific) and Wikipedia is so central to the fact-checking sphere (so often appears in their datasets), my understanding is that more generic fine-tuned models should be pretty appropriate for our context even without us training on our own very Wikipedia-specific examples. So we wouldn't necessarily need a huge dataset of those.
    • I've played around a bit in this space and my experience was that the data pre-processing is just as important if not more as compared to choosing the right model. For example:
      • There's the question of extracting the claim with its appropriate context from Wikipedia. This is a lot easier if it's an Edit Check so in-context in VisualEditor and we can capture specific instances of e.g., new sentence added + citation. But if you're applying a model to existing content (Suggested Edit), it's a bit trickier to capture the specific claim with enough context to understand but not so much that you're really fact-checking multiple claims. Folks vary between using basic heuristics -- e.g., grabbing just the sentence, the whole paragraph, etc. -- and using LLMs to extract the specific claim and adapt to varying levels of contexts. The latter is probably more effective -- see Dense X Retrieval: What Retrieval Granularity Should We Use?.
      • There's the question of fetching the text of the source being cited -- with AI etc. destroying the internet, we're seeing a lot more paywalled content and we'd have to make sure that we don't return a ton of false positives just because the external website blocked the request to some degree. Relying on Internet Archive links can help with this potentially but I've heard that websites have started to block them as well (e.g., Reddit block).
      • There's the question of cleaning the source HTML and extracting just the relevant text but not all the boilerplate, menus, etc. This generally isn't a big deal in this context because you just need to find one statement that supports (or not) so having some noisy text isn't a big deal, but it can slow things down if you're also processing a bunch of it with LLMs. Models with longer context windows also help reduce the importance of this.
      • There's the question of ranking the potential evidence for what's most relevant so not all of it has to be checked. Mostly I see folks recommend a basic similarity-based ranking followed by more complex re-ranking of the top few candidates with a LLM.
  • Suggested first steps:
    • If you all want to pick this up, I'd start with building a small-ish dataset of positive and negative examples (even just starting with 20 of each would probably be okay though 50 of each would be better).
      • If it's for Edit Check, I'd grab a random sample of recent content adds and manually check them. If you're having trouble finding failed-verification examples, I'd narrow down to those that were reverted on the assumption that there'd be more failed-verifications in those. For each negative example, I'd grab a positive example from the same article.
      • If it's for a Suggested Edit, I'd grab some of the {{failed_verification}} sentences and a few claims with citations in those articles that don't have that template applied (so you have a semi-balanced dataset of positive and negative claims).
    • Once you have the set of claims + citations, I'd then scrape the sources of all of those and see how effective that is. That should already give you a good sense qualitatively of the scale of the challenge. And then run that small dataset through a few LLMs or existing fine-tuned language models to see how they do. That should hopefully be reasonably quick and give a decent idea of what level of accuracy you can expect with a basic setup.

Tagging @Trokhymovych who has more applied experience than myself in case he has anything to add/correct (and @diego who is out at the moment but it's worth noting that he has expertise in this space too).

Adding a few more examples of volunteer-developed tools (and corresponding discussions on their talk pages) that are relevant:

I've been using AI source verification (one of the tools mentioned above) and also built a standalone version for non-wiki editors https://wiki-cite-checker.replit.app/

My takeaways

  1. Generic LLMs usually do a good job here. Frontier capabilities are not needed - simpler models like open-source Apertus https://www.swiss-ai.org/apertus also work well
  2. I have a data set of about 100 checks (article, claim, source, match %) and an even smaller dataset of manual edits that were made to fix the problems found by the tool. Happy to share if it can help.
  3. Re data processing, I've seen two approaches identifying claims supported by a given reference: LLM-based and deterministic algorithm (start at a tag, go back ignoring certain stuff to the previous tag or the beginning of paragraph/section). Both work pretty well.
  4. Exception to the previous item: often there are several references that support various parts of a claim. The existing tools can't handle it.

Question: suppose a dedicated model is developed, or an existing model is found to be adequate. What then? How can this model be made available to users? (both for Suggested Edit and Edit Check use cases)

I've tested a few models and written up the results here: https://github.com/alex-o-748/citation-checker-script/blob/main/Citation%20Verification%20-%20LLM%20Benchmarking.md

TLDR; Claude is the best with the specificity (true negative rate) of ~70% and less than 15% of false positives. The open source models were a bit worse but not by much.

Alaexis updated the task description. (Show Details)