Page MenuHomePhabricator

Regularly run constraint checks for all items
Open, MediumPublic

Description

As a user I would like to query all constraint violations in WDQS even if the item is not visited by logged in users, so I can find violations and fix them.

At the moment constraint checks are only run when a logged in user visits an item or if it is explicitly checked via API.
Therefore only some constraint violations are query able in WDQS.
To be able to query all constraint violations we need to run constraints checks on all existing items regularly.

Related Objects

Event Timeline

Jonas triaged this task as Medium priority.Aug 3 2018, 9:22 AM
Jonas created this task.

Can I make two related requests? Not sure how to post them as separate tasks related to this task, can someone from WMD do that?

  1. Change the query to return the offending value rather than the offending statement; which is more useful to eg generate a QS script to remove offending values. Eg for P2088 Distinct Values, change SPARQL (new) to this:
SELECT DISTINCT ?item ?itemLabel ?value WHERE {
	?statement wikibase:hasViolationForConstraint wds:P2088-DD4CDCEA-B3F6-4F02-9CFB-4A9E312B73A8 .
	?item p:P2088 ?statement .
	?statement ps:P2088 ?value.
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
  1. WMD should take over the operation of KrBot (that generates Constraint Violation pages) because it now takes huge resources ("Now the bot requires 106 GB of memory to load and process all data") and Ivan Krestinin cannot cope. See discussion at User_talk:Ivan_A._Krestinin and at Telegram

Can I make two related requests? Not sure how to post them as separate tasks related to this task, can someone from WMD do that?

You can create a new task. After saving click "edit related tasks" and add this one as parent :)

  1. WMD should take over the operation of KrBot (that generates Constraint Violation pages) because it now takes huge resources ("Now the bot requires 106 GB of memory to load and process all data") and Ivan Krestinin cannot cope. See discussion at User_talk:Ivan_A._Krestinin and at Telegram

I created a task for this: https://phabricator.wikimedia.org/T290635

@Addshore Resolved T204031 on 11 Aug: "Deploy regular running of wikidata constraint checks using the job queue: These now run after every edit"

In addition to "after every edit", I think "regular" means that periodic running should also be scheduled.

Eg I'm now cleaning up Crunchbase IDs that were inserted over many years (and a big batch a couple months ago).

I'm reopening this task.

It sounds like we need a definition of regularly!

Here's the goal: a SPARQL query should return all violations of a certain kind, with a possible data lag of a few hours.
So you need:

  • a baseline of having processed all items (TODO)
  • processing of changed items (DONE)
  • periodic processing of every item because constraint definitions or implementations can change globally (TODO?)

DBpedia Live uses a similar flow:

  • each save of a Wikipedia page causes a work item to be posted to a queue
  • but all remaining pages are also scheduled to be processed (with lower priority) to capture slow-changes to:
    • the Extraction Framework or
    • the DBpedia Mappings

Do you have a picture of how many violations, and violations split by type / kind exist on Wikidata at any given time?

periodic processing of every item because constraint definitions or implementations can change globally (TODO?)

Approaching this one is still an open question currently.
We could probably gather some data for this by looking at the distribution of latest edit times for all Item to see how much work this would actually be.

Do you have a picture of how many violations, and violations split by type / kind exist on Wikidata at any given time?

Looks like we’re averaging about 1⅔ constraint violations per item (excluding redirects, if I’m not mistaken):

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l                                                                                                                                                                                                                                                                                                            
853
$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
806
$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
811
$ units -t '(853 + 806 + 811) / 1500'
1.6466667

Properties have substantially more:

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=120&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r title; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=${title#*:}&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
2987
$ units -t '2987 / 500'
5.974

Lexemes less:

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=146&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r title; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=${title#*:}&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
187
$ units -t '187 / 500'
0.374

That would suggest some 155 million constraint violations in total (almost all of them on items):

$ units -t '((853 + 806 + 811) * 94504342 / 1500) + (2987 * 9166 / 500) + (187 * 578263 / 500)' 'million'
155.88818

That would suggest some 155 million constraint violations in total (almost all of them on items):

$ units -t '((853 + 806 + 811) * 94504342 / 1500) + (2987 * 9166 / 500) + (187 * 578263 / 500)' 'million'
155.88818

In that case it sounds like a SPARQL query should return all violations of a certain kind is probably not a realistic goal and we want something other than a SPARQL query for such a report.

Then I think I misunderstood your question? We certainly don’t have 155 million violations “of a certain kind”. I’m guessing most reports that users would want would be much smaller; on Wikidata:Database reports/Constraint violations/Summary, the highest “real” number of violations (split by property and constraint type) is 1993092, a bit under two million. ([...document.querySelectorAll('tbody td:nth-child(n+3):not(:empty)')].map(n => parseInt(n.textContent.trim())).filter(Number.isFinite).sort((a, b) => a - b).slice(-10); the top two numbers are “one of” constraints that are deprecated, i.e. they’re really just meant for suggestions, so I’m not counting them here.)

@Lucas, congratulations on your command line wizardry! I know "jq" but not
nearly to that extent, and I how did I not know about "units"?