Maniphest T201150

Regularly run constraint checks for all items
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• Jonas
	Aug 3 2018, 9:22 AM

Description

As a user I would like to query all constraint violations in WDQS even if the item is not visited by logged in users, so I can find violations and fix them.

At the moment constraint checks are only run when a logged in user visits an item or if it is explicitly checked via API.
Therefore only some constraint violations are query able in WDQS.
To be able to query all constraint violations we need to run constraints checks on all existing items regularly.

Related Objects
Search...

Status	Assigned	Task
Open	None	T244043 suggestions and possible decisions from the 2020 report on Property constraints
Open	None	T192565 Find constraint violations
Open	None	T201150 Regularly run constraint checks for all items
Open	None	T201147 Expose constraint violations to WDQS using event queue
Resolved	Addshore	T202404 investigate options for regularly running constraint checks
Resolved	Ladsgroup	T204031 Deploy regular running of wikidata constraint checks using the job queue
Resolved	Lydia_Pintscher	T204022 Add functionality to run QualityConstraint checks on an entity after every edit
Resolved	Addshore	T204714 Define job to run constraint checks for an entity
Resolved	Lucas_Werkmeister_WMDE	T204715 Add more tracking for constraint checks
Resolved	Addshore	T204716 Introduce configuration to run constraint checks after a certain amount of edits
Declined	None	T290635 Evaluate whether WMDE can take over our essential community run constraints bot
Invalid	None	T290641 tweak SPARQL queries that expose constraint violations

Event Timeline

• Jonas triaged this task as Medium priority.Aug 3 2018, 9:22 AM

• Jonas created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2018, 9:22 AM

• Jonas moved this task from Incoming to Ready to estimate on the Wikidata-Campsite board.Aug 3 2018, 9:22 AM

• Jonas updated the task description. (Show Details)Aug 3 2018, 1:37 PM

• Jonas moved this task from Ready to estimate to Incoming on the Wikidata-Campsite board.Aug 15 2018, 1:10 PM

Addshore added a project: [DEPRECATED] wdwb-tech.Aug 20 2018, 1:27 PM

Addshore subscribed.

Addshore removed a parent task: T201147: Expose constraint violations to WDQS using event queue.Aug 21 2018, 1:08 PM

Addshore added a subtask: T201147: Expose constraint violations to WDQS using event queue.

Addshore added a parent task: T192567: Expose constraint violations to WDQS.

Addshore edited parent tasks, added: T192565: Find constraint violations; removed: T192567: Expose constraint violations to WDQS.Aug 21 2018, 1:11 PM

• Jonas moved this task from Incoming to Needs Work on the Wikidata-Campsite board.Sep 4 2018, 1:18 PM

Addshore moved this task from Inbox to Research on the [DEPRECATED] wdwb-tech board.Sep 4 2018, 1:18 PM

Addshore added a subtask: T204022: Add functionality to run QualityConstraint checks on an entity after every edit.Sep 11 2018, 7:35 AM

Addshore mentioned this in T204024: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache.Sep 11 2018, 7:50 AM

Addshore moved this task from Research to Blocked on the [DEPRECATED] wdwb-tech board.Sep 11 2018, 7:56 AM

Addshore added a subtask: T204031: Deploy regular running of wikidata constraint checks using the job queue.Sep 11 2018, 9:16 AM

Addshore removed a subtask: T204022: Add functionality to run QualityConstraint checks on an entity after every edit.

Lydia_Pintscher closed subtask T202404: investigate options for regularly running constraint checks as Resolved.Sep 16 2018, 10:18 AM

Addshore moved this task from incoming to in progress on the Wikidata board.Sep 26 2018, 6:38 AM

Addshore moved this task from Blocked to Triaged High (100+) on the [DEPRECATED] wdwb-tech board.Oct 8 2018, 12:46 PM

Addshore moved this task from Triaged High (100+) to Hike Leftovers on the [DEPRECATED] wdwb-tech board.

Harmonia_Amanda subscribed.Oct 21 2018, 5:16 PM

Addshore changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Open to Stalled.Oct 31 2018, 4:38 PM

Lydia_Pintscher changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Stalled to Open.Nov 9 2018, 2:01 PM

Addshore changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Open to Stalled.Nov 13 2018, 2:36 PM

Addshore mentioned this in T212282: Timeout in Special Page Constraints check for France (Q142).Dec 19 2018, 12:41 PM

Addshore changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Stalled to Open.Jan 15 2019, 1:34 PM

Smalyshev moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.Jan 30 2019, 10:29 AM

Addshore moved this task from Needs Work to Blocked / Waiting / External on the Wikidata-Campsite board.Mar 5 2019, 9:59 AM

Addshore changed the task status from Open to Stalled.Jun 22 2019, 10:48 PM

Addshore changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Open to Stalled.

Addshore changed the status of subtask T201147: Expose constraint violations to WDQS using event queue from Open to Stalled.Jun 27 2019, 12:24 AM

Addshore changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Stalled to Open.Feb 17 2020, 8:28 AM

CamelCaseNick subscribed.Oct 17 2020, 5:20 PM

Addshore moved this task from Hike Leftovers to Goals on the [DEPRECATED] wdwb-tech board.Jan 22 2021, 10:40 PM

dcausse mentioned this in T274982: Disable fetching constraints from the wdqs updater.Feb 17 2021, 9:41 AM

Addshore removed a project: Wikidata-Campsite.Apr 22 2021, 4:08 PM

Michael changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Open to Stalled.Jul 27 2021, 9:11 AM

Ladsgroup changed the status of subtask T204031: Deploy regular running of wikidata constraint checks using the job queue from Stalled to Open.Aug 9 2021, 8:23 AM

These now run after every edit

Restricted Application added a project: User-Addshore. · View Herald TranscriptAug 11 2021, 12:47 PM

Addshore changed the status of subtask T201147: Expose constraint violations to WDQS using event queue from Stalled to Open.Aug 11 2021, 12:47 PM

Can I make two related requests? Not sure how to post them as separate tasks related to this task, can someone from WMD do that?

Change the query to return the offending value rather than the offending statement; which is more useful to eg generate a QS script to remove offending values. Eg for P2088 Distinct Values, change SPARQL (new) to this:

SELECT DISTINCT ?item ?itemLabel ?value WHERE {
	?statement wikibase:hasViolationForConstraint wds:P2088-DD4CDCEA-B3F6-4F02-9CFB-4A9E312B73A8 .
	?item p:P2088 ?statement .
	?statement ps:P2088 ?value.
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

WMD should take over the operation of KrBot (that generates Constraint Violation pages) because it now takes huge resources ("Now the bot requires 106 GB of memory to load and process all data") and Ivan Krestinin cannot cope. See discussion at User_talk:Ivan_A._Krestinin and at Telegram

In T201150#7341220, @VladimirAlexiev wrote:

Can I make two related requests? Not sure how to post them as separate tasks related to this task, can someone from WMD do that?

You can create a new task. After saving click "edit related tasks" and add this one as parent :)

In T201150#7341220, @VladimirAlexiev wrote:

WMD should take over the operation of KrBot (that generates Constraint Violation pages) because it now takes huge resources ("Now the bot requires 106 GB of memory to load and process all data") and Ivan Krestinin cannot cope. See discussion at User_talk:Ivan_A._Krestinin and at Telegram

I created a task for this: https://phabricator.wikimedia.org/T290635

So9q added a subtask: T290635: Evaluate whether WMDE can take over our essential community run constraints bot.Sep 9 2021, 9:09 AM

@Addshore Resolved T204031 on 11 Aug: "Deploy regular running of wikidata constraint checks using the job queue: These now run after every edit"

In addition to "after every edit", I think "regular" means that periodic running should also be scheduled.

Eg I'm now cleaning up Crunchbase IDs that were inserted over many years (and a big batch a couple months ago).

go to https://www.wikidata.org/wiki/Property_talk:P2088 and run one of the "SPARQL (new)" queries (a better version of one of them is shown at https://www.wikidata.org/wiki/User_talk:Ivan_A._Krestinin#KrBot_stuck%3F_%28Wikidata%3ADatabase_reports%2FConstraint_violations%2FP2088%29): you only get a few violations
now check out https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P2088: you see several thousand violations. This task will be done when all of them are available from SPARQL

I'm reopening this task.

VladimirAlexiev added a subtask: T290641: tweak SPARQL queries that expose constraint violations.Sep 9 2021, 10:22 AM

Bugreporter closed subtask T290641: tweak SPARQL queries that expose constraint violations as Invalid.Sep 9 2021, 1:57 PM

Lydia_Pintscher closed subtask T290635: Evaluate whether WMDE can take over our essential community run constraints bot as Declined.Sep 13 2021, 6:38 PM

It sounds like we need a definition of regularly!

Here's the goal: a SPARQL query should return all violations of a certain kind, with a possible data lag of a few hours.
So you need:

a baseline of having processed all items (TODO)
processing of changed items (DONE)
periodic processing of every item because constraint definitions or implementations can change globally (TODO?)

DBpedia Live uses a similar flow:

each save of a Wikipedia page causes a work item to be posted to a queue
but all remaining pages are also scheduled to be processed (with lower priority) to capture slow-changes to:
- the Extraction Framework or
- the DBpedia Mappings

VladimirAlexiev mentioned this in T290961: rewrite KrBot to publish Constraint Violation pages.Sep 14 2021, 11:20 AM

Do you have a picture of how many violations, and violations split by type / kind exist on Wikidata at any given time?

periodic processing of every item because constraint definitions or implementations can change globally (TODO?)

Approaching this one is still an open question currently.
We could probably gather some data for this by looking at the distribution of latest edit times for all Item to see how much work this would actually be.

Do you have a picture of how many violations, and violations split by type / kind exist on Wikidata at any given time?

Looks like we’re averaging about 1⅔ constraint violations per item (excluding redirects, if I’m not mistaken):

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l                                                                                                                                                                                                                                                                                                            
853
$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
806
$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r id; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=$id&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
811
$ units -t '(853 + 806 + 811) / 1500'
1.6466667

Properties have substantially more:

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=120&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r title; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=${title#*:}&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
2987
$ units -t '2987 / 500'
5.974

Lexemes less:

$ curl -s 'https://www.wikidata.org/w/api.php?action=query&list=random&rnnamespace=146&rnlimit=500&format=json&formatversion=2' | jq -r '.query.random | .[] | .title' | while IFS= read -r title; do curl -s "https://www.wikidata.org/w/api.php?action=wbcheckconstraints&id=${title#*:}&format=json&formatversion=2" | jq -c '.. | .results? | select(length > 0) | .[] | 1'; done | wc -l
187
$ units -t '187 / 500'
0.374

That would suggest some 155 million constraint violations in total (almost all of them on items):

$ units -t '((853 + 806 + 811) * 94504342 / 1500) + (2987 * 9166 / 500) + (187 * 578263 / 500)' 'million'
155.88818

In T201150#7355393, @Lucas_Werkmeister_WMDE wrote:
That would suggest some 155 million constraint violations in total (almost all of them on items):
$ units -t '((853 + 806 + 811) * 94504342 / 1500) + (2987 * 9166 / 500) + (187 * 578263 / 500)' 'million'
155.88818

In that case it sounds like a SPARQL query should return all violations of a certain kind is probably not a realistic goal and we want something other than a SPARQL query for such a report.

Then I think I misunderstood your question? We certainly don’t have 155 million violations “of a certain kind”. I’m guessing most reports that users would want would be much smaller; on Wikidata:Database reports/Constraint violations/Summary, the highest “real” number of violations (split by property and constraint type) is 1993092, a bit under two million. ([...document.querySelectorAll('tbody td:nth-child(n+3):not(:empty)')].map(n => parseInt(n.textContent.trim())).filter(Number.isFinite).sort((a, b) => a - b).slice(-10); the top two numbers are “one of” constraints that are deprecated, i.e. they’re really just meant for suggestions, so I’m not counting them here.)

@Lucas, congratulations on your command line wizardry! I know "jq" but not
nearly to that extent, and I how did I not know about "units"?

Lucas_Werkmeister_WMDE removed a subscriber: Lucas.Sep 16 2021, 10:06 AM

Sj awarded a token.Oct 19 2021, 4:32 PM

Addshore unsubscribed.Jun 27 2023, 12:40 PM

Regularly run constraint checks for all itemsOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Regularly run constraint checks for all items
Open, MediumPublic
Actions

Related Objects
Search...