Page MenuHomePhabricator

Regularly run constraint checks for all items
Open, MediumPublic

Description

As a user I would like to query all constraint violations in WDQS even if the item is not visited by logged in users, so I can find violations and fix them.

At the moment constraint checks are only run when a logged in user visits an item or if it is explicitly checked via API.
Therefore only some constraint violations are query able in WDQS.
To be able to query all constraint violations we need to run constraints checks on all existing items regularly.

Related Objects

Event Timeline

Jonas triaged this task as Medium priority.
Addshore changed the task status from Open to Stalled.Jun 22 2019, 10:48 PM

Can I make two related requests? Not sure how to post them as separate tasks related to this task, can someone from WMD do that?

  1. Change the query to return the offending value rather than the offending statement; which is more useful to eg generate a QS script to remove offending values. Eg for P2088 Distinct Values, change SPARQL (new) to this:
SELECT DISTINCT ?item ?itemLabel ?value WHERE {
	?statement wikibase:hasViolationForConstraint wds:P2088-DD4CDCEA-B3F6-4F02-9CFB-4A9E312B73A8 .
	?item p:P2088 ?statement .
	?statement ps:P2088 ?value.
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
  1. WMD should take over the operation of KrBot (that generates Constraint Violation pages) because it now takes huge resources ("Now the bot requires 106 GB of memory to load and process all data") and Ivan Krestinin cannot cope. See discussion at User_talk:Ivan_A._Krestinin and at Telegram

Can I make two related requests? Not sure how to post them as separate tasks related to this task, can someone from WMD do that?

You can create a new task. After saving click "edit related tasks" and add this one as parent :)

  1. WMD should take over the operation of KrBot (that generates Constraint Violation pages) because it now takes huge resources ("Now the bot requires 106 GB of memory to load and process all data") and Ivan Krestinin cannot cope. See discussion at User_talk:Ivan_A._Krestinin and at Telegram

I created a task for this: https://phabricator.wikimedia.org/T290635

@Addshore Resolved T204031 on 11 Aug: "Deploy regular running of wikidata constraint checks using the job queue: These now run after every edit"

In addition to "after every edit", I think "regular" means that periodic running should also be scheduled.

Eg I'm now cleaning up Crunchbase IDs that were inserted over many years (and a big batch a couple months ago).

I'm reopening this task.

It sounds like we need a definition of regularly!

Here's the goal: a SPARQL query should return all violations of a certain kind, with a possible data lag of a few hours.
So you need:

  • a baseline of having processed all items (TODO)
  • processing of changed items (DONE)
  • periodic processing of every item because constraint definitions or implementations can change globally (TODO?)

DBpedia Live uses a similar flow:

  • each save of a Wikipedia page causes a work item to be posted to a queue
  • but all remaining pages are also scheduled to be processed (with lower priority) to capture slow-changes to:
    • the Extraction Framework or
    • the DBpedia Mappings

Do you have a picture of how many violations, and violations split by type / kind exist on Wikidata at any given time?

periodic processing of every item because constraint definitions or implementations can change globally (TODO?)

Approaching this one is still an open question currently.
We could probably gather some data for this by looking at the distribution of latest edit times for all Item to see how much work this would actually be.

Do you have a picture of how many violations, and violations split by type / kind exist on Wikidata at any given time?

Looks like we’re averaging about 1⅔ constraint violations per item (excluding redirects, if I’m not mistaken):

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l                                                                                                                                                                                                                                                                                                            
853
$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
806
$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
811
$ units -t '(853 + 806 + 811) / 1500'
1.6466667

Properties have substantially more:

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=120&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r title; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=${title#*:}&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
2987
$ units -t '2987 / 500'
5.974

Lexemes less:

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=146&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r title; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=${title#*:}&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
187
$ units -t '187 / 500'
0.374

That would suggest some 155 million constraint violations in total (almost all of them on items):

$ units -t '((853 + 806 + 811) * 94504342 / 1500) + (2987 * 9166 / 500) + (187 * 578263 / 500)' 'million'
155.88818

That would suggest some 155 million constraint violations in total (almost all of them on items):

$ units -t '((853 + 806 + 811) * 94504342 / 1500) + (2987 * 9166 / 500) + (187 * 578263 / 500)' 'million'
155.88818

In that case it sounds like a SPARQL query should return all violations of a certain kind is probably not a realistic goal and we want something other than a SPARQL query for such a report.

Then I think I misunderstood your question? We certainly don’t have 155 million violations “of a certain kind”. I’m guessing most reports that users would want would be much smaller; on Wikidata:Database reports/Constraint violations/Summary, the highest “real” number of violations (split by property and constraint type) is 1993092, a bit under two million. ([...document.querySelectorAll('tbody td:nth-child(n+3):not(:empty)')].map(n => parseInt(n.textContent.trim())).filter(Number.isFinite).sort((a, b) => a - b).slice(-10); the top two numbers are “one of” constraints that are deprecated, i.e. they’re really just meant for suggestions, so I’m not counting them here.)

@Lucas, congratulations on your command line wizardry! I know "jq" but not
nearly to that extent, and I how did I not know about "units"?

Here's the goal: a SPARQL query should return all violations of a certain kind, with a possible data lag of a few hours.
So you need:

  • a baseline of having processed all items (TODO)
  • processing of changed items (DONE)
  • periodic processing of every item because constraint definitions or implementations can change globally (TODO?)

Why not just run a SPARQL query to *find* the violations directly from the data? With a decent SPARQL engine (e.g., QLever) this should be possible, but perhaps only for a single property at a time for some constraints. The lag would be the lag of the service, which for the public QLever-based service is not very large, as far as I know.

This would not involve any of the infrastructure that currently detects, records, and reports on violations.

If you have a particular set of violations that you want to find, I should be able to craft a query to find the violations, provided that the constraint description provides adequate information to determine just what constitutes a violation.

Not every constraint type can be reasonably checked in SPARQL. (For instance, “allowed units” needs access to the unit conversion configuration, which is a JSON file in the MediaWiki config repository.) And also –

This would not involve any of the infrastructure that currently detects, records, and reports on violations.

Put differently, you’re proposing to build an independent third implementation of constraint checks. I think it’s bad enough that we have two already (WikibaseQualityConstraints and KrBot) which sometimes give different answers; I don’t think adding another alternative, with its own subtle differences in what is and isn’t a violation, would be an improvement.

Given that KrBot is a third-party closed-source (?) tool, I would be happy having it replaced.

Given that there are constraints that appear to me to be specified as SPARQL queries, this would only be extending the reach of SPARQL in the constraint system.

I don't see why there couldn't be RDF information made available for the unit conversion configuration.

Is there a good description of how the constraint system is implemented, preferably including the role of third-party tools?

Given that KrBot is a third-party closed-source (?) tool, I would be happy having it replaced.

See T290961.