Page MenuHomePhabricator

Exploration of automated verifiability checks
Open, Needs TriagePublic

Description

This task aims to evaluate the feasibility of running automated checks on the verifiability of existing claim+citation pairs on Wikipedia. To this end it will provide a dataset of citation+claim pairs for testing different models/approaches along with insights about claim extraction and accessibility of sources.

Basic workflow

The focus of this task is existing claim+citations on Wikipedia. It will not assess the verifiability of claims that lack citations, which would require an additional step of locating potential sources. Steps:

  • For a given citation in a Wikipedia article, extract:
    • The claim associated with the citation. This has various options for how to process the Wikipedia content to get "clean" text. And then there are multiple potential approaches for identifying the specific claim:
      • Simple heuristics for collecting a fixed-size chunk of content preceding the citation
      • Slightly more complex grabbing the sentence where the citation appears
      • More complex using a language model to isolate the atomic claim most likely associated with the citation
    • The URL of the external source (if it exists). This requires some basic logic to link an inline citation to its reference (simple) and then extract the URL(s) to assess. This will range from no external URL to a single link to multiple (e.g., original + archived).
  • For a given external source URL, extract:
    • The content of the page -- i.e. raw HTML or PDF if it points to a file. In today's internet, this is far from assured. Many sites block non-human traffic (quite understandably) or put the actual content behind paywalls.
    • The "cleaned" text from the page content. HTML can be very verbose and contain a lot of noise (styling etc.) so you probably don't want to feed this directly into any language model. Instead you want just the actual text. While this is straightforward for some sites, in others the text might be buried in tabs or flashy visual elements. And you likely want to remove generic boilerplate text (menu options, legalese, etc.) from the text to also reduce the amount of noise.
  • For a given claim + external source text:
    • Assess whether the source supports the claim. Again, multiple potential approaches;
      • Binary yes supports or no does not.
      • Three-way classification: supports, explicitly rejects the claim, not enough information to support or reject
      • Even more complicated assessments that allow for partial support -- e.g., if a claim has multiple parts to it and some are supported but not all.

Potential production connections

Event Timeline

Updates (goes back a few weeks):

  • I built a simple prototype (code) for extracting claims from a Wikipedia article, scraping the source, and assessing whether the source supports the claim via a 770M language model. It's behind a basic username/password to prevent abuse but API is hosted at: https://ref-check.wmcloud.org/docs
  • I got started with creating a dataset that would be useful for evaluating the feasibility of these sorts of models for the Edit Check scenario. Essentially that means a small dataset of recent claims added via Visual Editor (where their tooling is deployed) that is stratified by editor experience (many checks are gated to only run for editors of a certain experience level as determined via edit count). Then for each of these recently-added claims, I extracted the actual claim text (one per revision even if there were multiple), supporting URL, and scraped source (if possible). Code can be found in: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/tree/master/fact-checking?ref_type=heads. I'm working on coding that sample for how successful the processing was and whether the claim is indeed supported (for later evaluation of models). Initial results:
    • Of the 300 edits that I started with (English Wikipedia; tagged with editcheck-newreference in December 2025): 8 failed HTML extraction -- i.e. the revision has since been deleted -- bringing the dataset to 292 revisions.
    • Of the 292 revisions, an additional 41 failed citation+URL extraction. That broke down as follows:
      • 8 (~20%) lacked a clear claim -- generally this is because the new citation was in an infobox or table or just sitting in its own paragraph with no easy way to associate with a claim.
      • 20 (~50%) were citations without associated URLs -- e.g., references to a book.
      • 13 (~30%) were revisions that didn't seem to have a new citation despite the editcheck-newreference claim. A few spot-checking suggested this was the result of the editor moving around existing citations in a way that likely triggered the edit tag but did not actually introduce a new citation into the revision.
    • Of the 251 claim+URL pairs, the HTTP statuses were as follows (I'm also now manually evaluating the success of the 200s):
      • 192 (~75%) had a 200 HTTP status, suggesting successful scraping.
      • 46 (~20%) had a 40# HTTP status, suggesting my request was blocked. Most of these were 403s (forbidden) but 5 had 404s (page not found). Presumably the 404s would be almost non-existent if the scraping was done right after the source was added but also would go up quite a bit depending on how long you waited between the source being added and the source being checked (relevant for a Suggested-Edits-type approach).
      • The other 13 (~5%) just hung and did not return content.
      • Note: I did this from my personal laptop with a transparent user-agent. In my experience, the results are not perfectly stable (though I didn't attempt to quantify this) and presumably would vary based on the type of IP address being used (is it known as a cloud or personal IP; how many other requests has the domain gotten from it) and user-agent (allowed scraper? browser request? etc.).

I added in a column in my data for approximately how many components there are to a given claim. This is essentially to estimate all the possible potential in-between states where a source might back up one part of a claim but not the claim in its entirety. For example, the dataset contains the claim Soundgarden was among one of the first grunge bands to be signed to a major label when they signed to A&M Records in 1989.. To me, that has four separate components (you might disagree but I suspect most folks would put it at 3-5 at least):

  • Soundgarden signed with A&M records
  • Soundgarden signed with A&M records in 1989
  • A&M records is a major label
  • Soundgarden is a grunge band and they were one of the first to sign to a major label in 1989

You could easily imagine a source saying Soundgarden was one of the first grunge bands to sign to a major label when they signed with A&M (but that would leave out verification of 1989). Or that they signed with A&M records in 1989 but not mentioning the "one of the first" piece. Or all of that but just saying it was surprising/unprecedented that A&M records signed them without tying that to them being a grunge band. Or any number of other combinations. All to say, there's a lot between a source not being about Soundgarden and a source having these full details. In the first fifty claims, the average number of components is 4.5 so you can see the above example is pretty standard. Some go above 10 when they e.g., list out a series of awards of roles that someone had. Very few sentences are atomic though in the sense of either being a "yes" or "no" as to whether the source provides evidence for it.

With that in mind, here's where the data stands are whether I think the source backs up the claim based on the first 51 that I've evaluated. Because this is ordered by edit count, this only reflects claims being added by users with <10 edits at this point:

  • ~30% unambiguously support the full claim
  • ~25% partially support the claim -- often just a minor detail left out but still missing.
  • ~25% do not support the claim
  • ~10% go to non-English content (in my experience, when using translation it more or less mapped to the percentages above for whether it supported)
  • ~10% go to sites with a paywall or that are not loading or that are otherwise broken/redirected presumably as a function of a few weeks passing between the citation being added and me checking it.

Some miscellaneous observations:

  • Verifying the claim sometimes requires metadata that isn't really in the body of the source (most commonly the data it was published but sometimes the author). That's generally doable but is important context to not filter out.
  • I think this claims component number speaks well to the challenges that arise when moving from a human-enforced rule to a machine-evaluated one. In the Soundgarden example, one has to ask themselves what the expected level of evidence is in the source. I suspect many editors would be comfortable with A&M records signing in 1989 and some indication that that was non-standard. Then they'd allow for other evidence from other sources to support the grunge piece and that A&M records is a major label. But if you actually break the claim into its atomic pieces, then a language model might reasonably come to the conclusion that the source does not support much of the claim. Something to consider in design of a final system. Much of this is handled by passing the outcomes to an editor to actually make the call, but it could easily generate a ton of noise if not well-calibrated to the actual norms on-wiki.

Thanks for sharing the results, very interesting!

You've shared the model's assessment, right? Have you checked whether it's correct? Would be great to see a confusion matrix.

I've done a similar analysis using generic commercial and open-source LLMs (here are the findings). I think it would be interesting to see how the MiniCheck model compares. I used this dataset for benchmarking. The rate of non-supported claims is much lower in my dataset and I hope it's closer to the general non-supported claim rate on WIkipedia :)

User:Polygnotus and I also built a tool for checking sources, so if the minicheck models works well we can use it there.

I wasn't able to make further progress on my components this week. I'm very excited about what you shared though @Alaexis and am wading through your data and shared mine separately. I agree that your non-supported claims rate is likely much closer to the general rate on Wikipedia -- my dataset so far has been skewed towards very new editors and eventually I should also add a field that indicates whether the edit was reverted or not. I think the more samples we have, the better here. I'll do my best to figure out how aligned our claim extraction was too so it's easier to merge them (and whether any differences would have any noticeable impact on performance).

Two questions for you @Alaexis in the meantime:

  • Do you think it'd be feasible for your user script to be setup up so folks could "donate" their data? Essentially allowing individuals to choose whether they wanted to log the citations they check, the model outputs, and their own decision? I wouldn't want to log automatically for privacy reasons but it might make building larger groundtruth samples easier. No worries if it'd be too complicated, there are other ways we could do it.
  • What's your take on how to handle the "partial" supports and where you draw the line? I have found a fair number in my sample. I appreciate that you have the multiple metrics to handle them in the evaluation, but I'm curious how you see them in practice? I didn't try to find sources to "complete" the verification myself though most of my partials felt like the statement likely was true, just not verifiable by that particular citation. For the ones that felt like they were more in the no-original-research category, I think I marked as "no" even if there was partial support. For example, the claim "Future missions may use radiation-resistant fungi-derived paints, to coat the walls of spacecraft." has some support from its citation in the sense that the article talks about radiation-resistant fungi-derived paints in the context of space travel but I didn't see any evidence supporting the "future missions" part so marked it as a straight "no".

A few notes looking ahead:

  • I'd like to wait on T412357 being resolved before running some of the larger models. It seems good progress is being made and it would be relatively easy to spin up an alternative 3rd-party service if it comes to that, but I prefer using our infrastructure as it makes it easier to scale out the inference/evaluations as this progresses.
  • I have a meeting set for February 18th to get a bit more WMF Product perspective on this.

Re 1. it's certainly feasible technically. I thought about it but didn't do it to avoid having to think about privacy. I only see the number of requests to my proxy, it's about 20-30 per day and I'm not entirely sure they are all real. I'd rather not make it opt-in since most people won't click on it, and given the total volumes we'll hardly get any data. Do you think that collecting data by default would be okay if I don't store any PII-adjacent stuff (no user names, only verification results) and add a disclaimer to the UI? Or maybe you have better ideas how to solve this.

I'll answer your second question later.

@Isaac , the short answer to your second question is that everything that I couldn't classify as "Supported" or "Not Supported" ended up as "Partially supported"

Most of the time, the text contained several claims and the source supported only some of them, e.g., https://en.wikipedia.org/w/index.php?title=Immigration_to_the_United_States&oldid=1331476438. reference 11, "the lowest three-year increase in decades" is not supported.

Sometimes the source used attribution or hedged language while the claim was made in wikivoice. I remember seeing this but I'm not sure I have such cases in the benchmarking sample.

Re 1. it's certainly feasible technically. I thought about it but didn't do it to avoid having to think about privacy. I only see the number of requests to my proxy, it's about 20-30 per day and I'm not entirely sure they are all real. I'd rather not make it opt-in since most people won't click on it, and given the total volumes we'll hardly get any data. Do you think that collecting data by default would be okay if I don't store any PII-adjacent stuff (no user names, only verification results) and add a disclaimer to the UI?

@Alaexis that makes sense and agreed on not collecting any PII-adjacent stuff. Maybe to start we can create a second copy of the script that's intended for data collection to at least test it out and use ourselves for building up the dataset before considering whether we want to share more widely. One thing to keep it very simple might just be posting the data to a Google Form. For example something like this form and then you just extract the various question codes and can post responses to it (Python example below). But I'm open to other solutions such as server-side logging, which would mean missing the opportunity to get the editor's input on whether the AI was correct or not but would keep things simple.

import requests

# using your first example in benchmarks though truncating the text for readability purposes
# one thing is I'm not sure what the limit on Google Forms fields are for text length so that could potentially be an issue
data = {
  "entry.1069670916": "https://en.wikipedia.org/w/index.php?title=Immigration_to_the_United_States&oldid=1331476438",  # Article URL
  "entry.2113657510": "Immigration to the United States", # Article Title
  "entry.1610310131": "1", # Citation Number
  "entry.1248094404": "Immigration has been a major source of population growth...", # Claim Text
  "entry.1981845263": "Immigration has been a major source of population growth...", # Claim Container
  "entry.1334551845": "https://cis.org/Report/ForeignBorn-Number-and-Share-US-Population-AllTime-Highs-January-2025", # Source URL
  "entry.2083782944": "Foreign-Born Number and Share of U.S. Population at All-Time Highs in January 2025....", # Source Text
  "entry.2082276032": "SUPPORTED", # AI Prediction
  "entry.1696057733": "SUPPORTED" # Groundtruth (assuming we add a button for folks to indicate whether they accept the prediction or not)
}

form_url = "https://docs.google.com/forms/u/0/d/e/1FAIpQLSfVe-HTiEN8uxdF1os_Z7SgZ7FuJkgUSE562ptzwqMn5P52fQ/formResponse"
headers = {'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'}
requests.post(form_url, data=data, headers=headers)

Most of the time, the text contained several claims and the source supported only some of them

Yeah, I saw a fair bit of this too. I guess as we build out the sample, we'll have to see just how common the partial support is and decide whether it needs to be further broken down or not.

And now some thoughts after comparing your claims+source data with my approach to extracting claims+sources:

  • Claim extraction:
    • You extract the text between the citation and previous citation (or start of paragraph). I am extracting the full sentence regardless of where the citation appears and whether there are other citations in it. Short of using an LLM to extract the atomic facts, I'm not sure there's an obvious answer to the trade-off between clarity of claim and providing as much context as possible to the model. Something I'll keep thinking about though. You also save the "container" -- i.e. the text for the full paragraph -- but I don't think you're providing this to the LLM when it makes its assessment. I could see it potentially being valuable to give more context when the claim itself lacks that and LLMs should be able to handle that three-way input (claim text, context, source).
  • Plaintext extraction:
    • You are doing some basic cleaning of the HTML and truncating to 12000 characters. I'm using the Python trafilatura library, which has a bunch more rules for cleaning/filtering the content and no character limit. I re-scraped the 66 unique sources that are in your benchmark and were scraped effectively by you. Out of them, 7 were blocked with trafilatura (403s and one 202 which also means no text extracted). The 59 remaining sources were successfully scraped by both, though 33 are Internet Archive links so that's not as many unique domains as it sounds like. As expected, your source plaintext (except where truncated) is generally longer than what I got from trafilatura (median is about 30% longer). For larger LLMs, that additional text isn't an issue I suspect but it gives me some sense of how much removal trafilatura is doing (which when I inspect, usually looks pretty reasonable).

My next steps:

  • I'll run the MiniCheck models on your dataset to see how they do.
  • I'm happy to more thoroughly mock up what a script that does google form logging would be like. Your code looks quite easy to build on (thank you) so I suspect I shouldn't have too much in the way of issues there and I'll just create a copy in my own user space.

@Isaac

Thanks for the detailed response

Looking forward to seeing the results of Minicheck!

Data collection: I think server-side logging s the simplest path. It avoids the logistics of Google Forms (account ownership, sharing, field length limits). I'll just record the verification metadata - article URL, citation number, verdict, confidence, source URL and let the user know that this data is collected.

Ground truth: collecting it from users in isn't practical given the UX flow - by the time someone has investigated a flagged citation, they've navigated away. One thing we could try as a proxy though: if a user makes a tagged edit to a citation after checking it, that's a signal that something was off.

Looking forward to seeing the results of Minicheck!

This is still pending some upgrades to the GPU drivers on our machines unfortunately. But in the meantime I got access to some inference services via HuggingFace. MiniCheck isn't supported out-of-the-box there but I did run a small-scale test with the 37 groundtruth statements I have from my dataset that were effectively scraped and gpt-oss-20B (there's also a 120B model which presumably does better but honestly 20B was well within reasonable expectations). I used your system prompt (thanks!) and while I haven't implemented your accuracy metrics yet, initial findings/vibes:

  • gpt-oss-20B is pretty good about adhering to JSON format though one issue w/ missing a quotation mark that can likely be easily fixed post-hoc.
  • It didn't match "Confidence" consistently with the verdict in "NOT SUPPORTED" cases. Seems to be giving confidence of it not being supported, as opposed to just a numerical value for support that it then maps to a label. I think this could be easily adjusted in the prompt to just be confidence and then we infer the verdict from there.
  • Decent match with groundtruth -- good on No's; mixed on Partial/Unclear but leans conservative; conservative on Yes's. For five mismatches on Yes's:
    • Two were reasonable but the dates were missing from the source text and so it made incorrect assumptions about that. This can likely be fixed by pulling over the data (which trafilatura is pretty good about extracting).
    • Two were likely the model being overly-conservative in also wanting to verify the more minor detail (Sabrina Carpenter being American; the character's father's name) but also reasonable catches and something that a moderator who is familiar with the topic could easily verify.
    • One was borderline between NOR and being a reasonable reformulation of the content (a few examples were given, does this constitute "many"?).
  • Explanations are pretty coherent/strong

Data collection: I think server-side logging s the simplest path.

Sounds good. I'll check with our Legal folks to make sure they don't have concerns about me using this data but it's your user script and what you describe sounds reasonable to me.

Ground truth: collecting it from users in isn't practical given the UX flow - by the time someone has investigated a flagged citation, they've navigated away. One thing we could try as a proxy though: if a user makes a tagged edit to a citation after checking it, that's a signal that something was off.

Yeah, I'll think a bit about this but also we don't need so much data that we can't hand-label after the fact. In our experience, the larger models are quite good so I suspect this should be pretty quick for us to validate.

Updates on product conversations:

  • After speaking with @ppelberg and @Sucheta-Salgaonkar-WMF, these are the questions that we can ideally answer by the end of March to help guide Product decisions:
    • A clear assessment of each stage in this pipeline (identification of citation -> extraction of associated claim -> extraction of associated URL -> scraping the external source -> model prediction) and what our precision/recall looks like at each stage. We already have a decent sense I think of most parts and really it's just getting a better sense of how small of a model we can get away with. And then putting it all together to get a sense of what coverage a tool could provide and how to think about contextualizing the results.
    • A sense of whether these outputs would be appropriate to put in front of newcomers (as a nudge to fix a citation they have added) or if most of the value would be for patrollers (prioritizing edits to verify).
  • I'm going to speak with a few folks and then aim to get clearer answers to next steps by the end of next week.

Weekly updates:

  • I've scoped out what the remaining work for this would look like before the end of March. Essentially it would have two parallel components:
    • Expand out the evaluation datasets further
    • Expand out the model evaluation to include more of these mid-sized models
  • There are some stretch components that I would love but aren't necessary in this first pass -- e.g., expanding to more languages; testing out more smaller, BERT-like language models (probably with a LLM still used but for splitting claims into their atomic components).
  • My summary to date on what we know about the different stages of an automated-verification pipeline based on my annotated dataset of newcomer edit data from enwiki in Dec '25:
    • Step 1) Citation Extraction: approximately 10% of citations can't be evaluated (mostly because they lack a URL but some because the claim is unclear). But otherwise we can extract a claim + URL for most newly-added citations. My sense is that usage of a LLM in the final prediction stage allows us to not further refine the claims -- i.e. attempt to split them into individual components.
    • Step 2) Source Extraction: approximately 25% of sites block scraping. Another 10% or so look successful but aren't really – e.g., paywalls, content is embedded in elements that aren't easy to process, requires an additional click to get from paper abstract to full-text or expand content.
    • Step 3) Source+Claim Verification: of note, before we even reach this stage, we're looking at almost 50% of citations where we can't return an evaluation. While that's pretty high and likely to go higher given that GenAI is breaking the open web, this still does leave a lot of references that we can evaluate. Performance here varies for larger models, but generally I find them (even as "small" as gpt-oss-20B) to be well-calibrated and return useful explanations. I have not tested smaller language models like the 7B MiniCheck though their 0.7B-parameter model did not seem effective to me.
  • We're working on getting more support for me so this can move more quickly than I'm able to push it!
  • I spoke with @Samwalton9-WMF about the relevance of this work to moderator workflows with the idea of e.g., a filter in something like RecentChanges that we could provide to folks where they can quickly identify edits that failed some aspect of this automated verification check so they can look into it more deeply. He confirmed that anecdotally checking the references is one of the more time-consuming aspects of review, especially in spaces like articles-for-creation where you're not just checking one citation+claim but potentially many references and comparing against a lot of different claims. Relevant ticket: T409059: Surface noteworthy edits to editors
  • I noticed the notification about logging in your script @Alaexis so excited that the server-side data collection is happening now too! Our Legal folks are at an offsite this week but I still have a TODO to connect with them over my usage of that data.

@Isaac thanks for sharing the statistics and impressions. Good to get a confirmation that this is solving a real problem for editors reviewing recent changes or articles for creation. It's tempting to imagine an edit filter but it needs to be considered carefully - there will be false positives and errors and we don;t want to discourage new editors. Another modality is a bot that flags citations that fail verification by adding tags or by posting at the talk pages.

Regarding the logging, I've just realised something broke down on Feb 22 and nothing was written to the db since then. It's very frustrating as I see growing usage, so I hope to fix it soon.

I've fixed the logging so now we'll start collecting data.

Weekly updates:

  • @Trokhymovych has joined the project! We've started collecting models to test on the task. The set we have right now:
    • openai/gpt-oss-20b (mid-sized but oriented for policy-related tasks)
    • google/gemma-3n-E4B-it and gemma-3-4b-it (smaller but not open-source but said to be good at multilingual tasks so worth evaluating)
    • Qwen3.5-397B-A17B -> 397B (open-weights but huge as presumably an upper limit on performance, similar to the 3rd-party models in the user script)
    • Qwen3.5-9B and Qwen3.5-4B -- smaller open-weights but reportedly good performance
  • Still discussing specific hypothesis language but hopefully will officially register that soon.
  • We had a question about potential volume if this was run as something like Suggestion Mode so I computed some data: December 2025 had 78,460 new edits on English Wikipedia that added a citation via VisualEditor. So even if we can only cover half of that with the new system after removing sources w/o URLs and ones where the sources can't be scraped, that's still 40,000 edits that could be assessed per month, which is approximately one per minute. I'd have to look at how much this goes up if you include wikitext editor etc. (so run it on every edit regardless of how it's created).

I've fixed the logging so now we'll start collecting data.

Great!

Great news! Running it like Suggestion Mode would be amazing. Is it something that WMF infrastructure can support now? This would really move the needle. The reach of user scripts is limited (mine has 14 active users).

One more failure mode I've encountered is context window overflow (e.g., https://www.hudson.org/sahelian-or-littoral-crisis-examining-widening-nigerias-boko-haram-conflict). I haven't had time to investigate it - when I calculated the approximate number of tokens it seemed like it should fit in the 65k window but somehow it doesn't. I suppose the models you plan to check support larger context windows?