Maniphest T192567

Expose constraint violations to WDQS
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	• Jonas
	Apr 19 2018, 4:00 PM

Description

We could expose the constraint violations to Wikidata-Query-Service, so we could query them.
For this we need to have an interface that allows us to write to Wikidata-Query-Service from Wikibase-Quality.
So when running a constraint check for a specific Item, we could delete its existing violations and create new ones in Wikidata-Query-Service

This would allow queries like:
Give me all

mandatory constraint violations
for IMDb ID (P345)
for actors that live in Germany and are born before 1945

See https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Constraints for the details of RDF implementation (TLDR: wikibase:hasViolationForConstraint predicate).

NOTE: Not all constraint violations are exposed, yet. At the moment only a small fraction is available in WDQS. We can further improve it when T189458: re-enable wdqs kafka poller is working.

Demo
All statements with constraint violations:
http://tinyurl.com/yd5t689d

Map/timeline/image grid of items that have a statement with a constraint violation:
http://tinyurl.com/yd62za8q

Bar chart of statements that have a constraint violation
grouped by instance of the regarding item:
http://tinyurl.com/ycb8oswo

Details

Subject	Repo	Branch	Lines +/-
Add documentation for wikibase:hasViolationForConstraint	mediawiki/extensions/Wikibase	master	+14 -0
Enable constraints loading everywhere	operations/puppet	production	+1 -1
Enable constraints fetching on internal cluster	operations/puppet	production	+1 -1
Enable constraints fetching for test cluster	operations/puppet	production	+1 -1
Enable fetching constraints for Updater	operations/puppet	production	+23 -10
Add loading constraints data	wikidata/query/rdf	master	+112 -46

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T244043 suggestions and possible decisions from the 2020 report on Property constraints
Open	None	T192565 Find constraint violations
Resolved	Smalyshev	T172380 Query constraint violations with WDQS
Resolved	Smalyshev	T192567 Expose constraint violations to WDQS
Resolved	• Jonas	T194760 Run full constraint check when saving a statement
Resolved	Lucas_Werkmeister_WMDE	T194762 Add API to return violations in TTL (turtle) format
Resolved	Lucas_Werkmeister_WMDE	T190933 Reduce duplicated service instantiations in WBQC
Declined	None	T194761 Add wbcheckconstraints parameter to only get cached results
Resolved	Smalyshev	T198357 Call to constraintsrdf API returns error
Resolved	Smalyshev	T199146 "Blocked" response when trying to access constraintsrdf action from production host

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Lydia_Pintscher moved this task from incoming to needs discussion or investigation on the Wikidata board.Apr 23 2018, 7:29 AM

In T192567#4146952, @Smalyshev wrote:

So, if you want to put a dataset into the database, here's the questions to answer:

What kind of queries we are planning to run? Which use cases they would support? I am not sure I am clear on use case for "constraint violations for actors that live in Germany and are born before 1945" - what use case would produce such query?

The use case is similar to the existing maintenance queries. A user wants to keep his domain or project clean and achieve the 100% criteria.

Do we need to have the data in WDQS at all? We have MWAPI gateway, maybe we could just query the suitable API?

Storing it somewhere else will not allow to scale that easily and be flexible with the queries at the same time.

Is this data set separate from Wikidata data or needs to be in the same namespace (depends on cross-querying needs)?

Yes, cross querying is needed.

What is the data model (would be nice to have a wiki page describing it)?

@Lucas_Werkmeister_WMDE could you please provide a draft.

How the data are updated - when update happens, what triggers it, which data are updated, how soon we need the updates, etc. Note that there is no external push write interface to the database, by design, and having it would involve significant security hurdles to clear - to ensure that only authorized clients can modify the data, and only the part of the data they are authorized to. As Blazegraph does not have support for users/roles and other access controls, we may have to find some solution to it.

When a constraint check is executed for an Item the result will be stored and the old result for that Item will be deleted.
Access will only be allowed from within the cluster.

How these data would be imported/reimported if node is reimaged? Right now WDQS is designed as a secondary data storage - i.e. it does not store any data which does not have primary source, and can be cleaned up and restored from external sources.

It will never be imported and it will never be complete.
It is just a snapshot.

When a constraint check is executed for an Item the result will be stored and the old result for that Item will be deleted.

So, constraint checks are bound to existing items? I wonder if it would be possible to expose them in the same way page properties are exposed.

It will never be imported and it will never be complete.

It is just a snapshot.

But a snapshot of what? Right now we can (and do) delete the whole database, reimage any host and restore it from latest dump + RC feed + category dump. However, neither of those contains constraint checks data. So how this would be managed?

Access will only be allowed from within the cluster.

Still, this means we need to create non-local write interface that previously didn't exist, and put access controls (by IP or otherwise) to it. Will need to research how easy that would be...

In T192567#4151226, @Smalyshev wrote:

So, constraint checks are bound to existing items? I wonder if it would be possible to expose them in the same way page properties are exposed.

Yes, kinda. The problem is they are very expensive and it can take multiple seconds or even a minute for them to be executed.
That is why we cannot put them to the page properties AFAIK.
We currently execute them when a logged in user visits the item page.
It is done with a widget that calls the wbcheckconstraints API

But a snapshot of what? Right now we can (and do) delete the whole database, reimage any host and restore it from latest dump + RC feed + category dump. However, neither of those contains constraint checks data. So how this would be managed?

It is ok if they are deleted. They will be repopulated over time when users execute constraint checks.

So from what I understood you would prefer a pulling mechanism similar to recent changes. Is that true?
Pulling should not be a problem. We haven an API and we could notify you via event bus.
Important would be that you only pull cached results, because calculating them is very expensive.
Does that sound good?

In T192567#4157289, @Jonas wrote:

It is ok if they are deleted. They will be repopulated over time when users execute constraint checks.

That would be only true for the entities that have been visited by logged in users since the deletion. It could take weeks or even months to have the required entities in the cache to be able to find and fix all the constraint violations "for actors that live in Germany and are born before 1945" using the queries. Even if the violations are stale for some time it would be nice having a job or something going through the entities creating and caching reports for the entities that don't have one stored. At least from the place where WDQS will be pulling that info.

In T192567#4157979, @Agabi10 wrote:

That would be only true for the entities that have been visited by logged in users since the deletion. It could take weeks or even months to have the required entities in the cache to be able to find and fix all the constraint violations "for actors that live in Germany and are born before 1945" using the queries. Even if the violations are stale for some time it would be nice having a job or something going through the entities creating and caching reports for the entities that don't have one stored. At least from the place where WDQS will be pulling that info.

Sorry, I know it is not optimal, but would you rather want to wait for the ideal solution or have the querying now?
We can later still fix the restore part and there are multiple workarounds we could use meanwhile...

• iecetcwcpggwqpgciazwvzpfjpwomjxn subscribed.Apr 26 2018, 7:09 AM

Lucas_Werkmeister_WMDE edited parent tasks, added: T172380: Query constraint violations with WDQS; removed: T192565: Find constraint violations.Apr 27 2018, 10:33 AM

The data model would be fairly simple – a single triple, as I already mentioned.

wds:Q42-45E1E647-4941-42E1-8428-A6F6C848276A wikibase:hasViolationForConstraint wds:P463-6F6E17F0-2650-4835-9250-2F893C77B301.

Both the subject and the object are statement nodes. The exact name of the predicate is still up for discussion. To keep the data model simple, I think we can squash constraint violations for the main snak, qualifiers, and references all into the same predicate (which is why I’m proposing “has violation” and not “violates” now), and people can see where exactly the violation is when they visit the entity.

Example query (the one in the task description: mandatory constraint violations for “IMDb ID” on actors living in Germany born before 1945):

SELECT ?item ?itemLabel ?constraintTypeLabel WHERE {
  wd:P345 p:P2302 ?constraint.
  ?constraint ps:P2302 ?constraintType;
              pq:P2316 wd:Q21502408.
  ?item wdt:P31 wd:Q5;
        wdt:P106 wd:Q33999;
        wdt:P551/wdt:P17 wd:Q183;
        wdt:P569 ?dob.
  FILTER(?dob < "1945-01-01"^^xsd:dateTime)
  ?item p:P345/wikibase:hasViolationForConstraint ?constraint.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

There are some more requested example queries in T172380: Query constraint violations with WDQS (I just made that task a parent task of this one).

abian subscribed.May 1 2018, 12:52 PM

hoo subscribed.May 8 2018, 1:44 PM

After discussion with @Jonas here's what we can do now without very major efforts:

We make wbcheckconstraints API to produce RDF representation (also TBD) of the checks
We create a parameter for wbcheckconstraints to only deliver results if they can be delivered fast (e.g. already cached)
WDQS Updater, when updating an edited item, will also pull the API above and load the constraints data, and join it with the rest of the data.

This means it will only be updated when the item is edited, and only in the case if the constraint check can prepare the cached data by the time Updater gets to it. It also has a race condition where one server could hit the Wikidata before constraints are ready and the other - after, so the servers will have different data. We will need to see whether it is a real concern in production. But we could at least try this one as a prototype.

Also seems to be a good candidate for Wikimedia-Hackathon-2018.

Smalyshev added a project: User-Smalyshev.May 8 2018, 6:27 PM

This means it will only be updated … if the constraint check can prepare the cached data by the time Updater gets to it.

The updater usually doesn’t take more than a few seconds to reach an item, right? I’m skeptical whether this will be possible…

Also, the updater needs to learn how to remove the old constraints data. (I guess it already knows how to remove other old data from the item, so hopefully that shouldn’t be too difficult.)

The updater usually doesn’t take more than a few seconds to reach an item, right?

Yes.

I’m skeptical whether this will be possible…

Then we need a different model.

Also, the updater needs to learn how to remove the old constraints data.

That's not a problem, we do the same thing for the rest of the RDF data.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.May 8 2018, 8:18 PM

Multichill moved this task from Backlog to Project on the Wikimedia-Hackathon-2018 board.May 9 2018, 9:30 PM

Multichill subscribed.

Lucas_Werkmeister_WMDE added projects: Wikidata-Ministry-Of-Magic, Wikimania-Hackathon-2018.May 15 2018, 1:34 PM

Lucas_Werkmeister_WMDE set the point value for this task to 8.May 15 2018, 1:39 PM

Lucas_Werkmeister_WMDE moved this task from Tasks to User Stories on the Wikidata-Ministry-Of-Magic board.May 15 2018, 1:59 PM

Lucas_Werkmeister_WMDE removed a project: Wikimania-Hackathon-2018.May 15 2018, 2:07 PM

Smalyshev added a comment.May 15 2018, 9:07 PM

This comment was removed by Smalyshev.

Smalyshev added a comment.May 15 2018, 9:08 PM

This comment was removed by Smalyshev.

Pasleim subscribed.May 15 2018, 10:49 PM

Change 434015 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] [WIP] [DNM] Add loading constraints data

https://gerrit.wikimedia.org/r/434015

gerritbot added a project: Patch-For-Review.May 19 2018, 4:03 PM

Smalyshev claimed this task.May 19 2018, 4:03 PM

Smalyshev moved this task from Next to Doing on the User-Smalyshev board.May 20 2018, 10:27 AM

Smalyshev moved this task from Doing to Next on the User-Smalyshev board.May 29 2018, 7:49 PM

Lucas_Werkmeister_WMDE closed subtask T194761: Add wbcheckconstraints parameter to only get cached results as Declined.Jun 5 2018, 1:33 PM

Lucas_Werkmeister_WMDE added a project: Wikidata-Campsite.Jun 5 2018, 1:39 PM

Lea_Lacroix_WMDE closed subtask T194762: Add API to return violations in TTL (turtle) format as Resolved.Jun 11 2018, 2:49 PM

Lydia_Pintscher closed subtask T194760: Run full constraint check when saving a statement as Resolved.Jun 27 2018, 10:11 AM

Smalyshev added a subtask: T198357: Call to constraintsrdf API returns error.Jun 27 2018, 8:36 PM

Smalyshev moved this task from Next to Doing on the User-Smalyshev board.Jun 27 2018, 11:53 PM

Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.

Smalyshev moved this task from Waiting/Blocked to Doing on the User-Smalyshev board.Jun 29 2018, 8:30 PM

Smalyshev moved this task from Doing to In review on the User-Smalyshev board.Jun 30 2018, 5:02 AM

• Vvjjkkii reopened subtask T194761: Add wbcheckconstraints parameter to only get cached results as Open.Jul 1 2018, 1:09 AM

• Vvjjkkii reopened subtask T194760: Run full constraint check when saving a statement as Open.

• Vvjjkkii reopened subtask T194762: Add API to return violations in TTL (turtle) format as Open.

CommunityTechBot closed subtask T194762: Add API to return violations in TTL (turtle) format as Resolved.Jul 2 2018, 5:18 AM

CommunityTechBot closed subtask T194760: Run full constraint check when saving a statement as Resolved.

CommunityTechBot closed subtask T194761: Add wbcheckconstraints parameter to only get cached results as Declined.

Change 434015 merged by jenkins-bot:
[wikidata/query/rdf@master] Add loading constraints data

https://gerrit.wikimedia.org/r/434015

Smalyshev moved this task from In review to Waiting/Blocked on the User-Smalyshev board.Jul 9 2018, 7:28 PM

Smalyshev added a subtask: T199146: "Blocked" response when trying to access constraintsrdf action from production host.

Smalyshev closed subtask T198357: Call to constraintsrdf API returns error as Resolved.

Smalyshev moved this task from Waiting/Blocked to In review on the User-Smalyshev board.Jul 9 2018, 8:56 PM

Smalyshev closed subtask T199146: "Blocked" response when trying to access constraintsrdf action from production host as Resolved.Jul 11 2018, 10:15 PM

Change 445454 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable fetching constraints for Updater

https://gerrit.wikimedia.org/r/445454

Change 445454 merged by Gehel:
[operations/puppet@production] Enable fetching constraints for Updater

https://gerrit.wikimedia.org/r/445454

Change 447740 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable constraints fetching for test cluster

https://gerrit.wikimedia.org/r/447740

Change 447741 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable constraints fetching on internal cluster

https://gerrit.wikimedia.org/r/447741

Change 447742 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable constraints loading everywhere

https://gerrit.wikimedia.org/r/447742

Smalyshev moved this task from In review to Doing on the User-Smalyshev board.Jul 25 2018, 7:37 PM

Smalyshev moved this task from Doing to In review on the User-Smalyshev board.Jul 25 2018, 9:53 PM

Change 447740 merged by Gehel:
[operations/puppet@production] Enable constraints fetching for test cluster

https://gerrit.wikimedia.org/r/447740

Smalyshev moved this task from In review to Waiting/Blocked on the User-Smalyshev board.Jul 27 2018, 7:23 PM

Smalyshev moved this task from Waiting/Blocked to Done on the User-Smalyshev board.

Smalyshev moved this task from Incoming to Current work on the Wikidata-Query-Service board.Jul 27 2018, 9:03 PM

Change 447741 merged by Gehel:
[operations/puppet@production] Enable constraints fetching on internal cluster

https://gerrit.wikimedia.org/r/447741

Change 449329 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add documentation for wikibase:hasViolationForConstraint

https://gerrit.wikimedia.org/r/449329

Smalyshev updated the task description. (Show Details)Jul 30 2018, 8:56 PM

Change 447742 merged by Gehel:
[operations/puppet@production] Enable constraints loading everywhere

https://gerrit.wikimedia.org/r/447742

• Jonas moved this task from Incoming to Test (Verification) on the Wikidata-Campsite board.Jul 31 2018, 6:30 PM

• Jonas updated the task description. (Show Details)

Smalyshev updated the task description. (Show Details)Jul 31 2018, 7:35 PM

• Jonas awarded a token.Jul 31 2018, 7:44 PM

Smalyshev closed this task as Resolved.Jul 31 2018, 8:04 PM

Smalyshev updated the task description. (Show Details)

• Jonas updated the task description. (Show Details)Jul 31 2018, 8:44 PM

• Jonas reopened this task as Open.Aug 1 2018, 7:48 AM