Expose constraint violations to WDQS
Open, NormalPublic8 Story Points

Description

We could expose the constraint violations to Wikidata-Query-Service, so we could query them.
For this we need to have an interface that allows us to write to Wikidata-Query-Service from Wikibase-Quality.
So when running a constraint check for a specific Item, we could delete its existing violations and crate new ones in Wikidata-Query-Service

This would allow queries like:
Give me all

  • mandatory constraint violations
  • for IMDb ID (P345)
  • for actors that live in Germany and are born before 1945
Jonas created this task.Apr 19 2018, 4:00 PM
Jonas triaged this task as Normal priority.
Restricted Application added a project: Discovery. · View Herald TranscriptApr 19 2018, 4:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel added a subscriber: Gehel.Apr 20 2018, 1:00 PM
Gehel added a comment.Apr 20 2018, 1:08 PM

A few notes (without much thinking, so don't read too much into them):

  • WDQS nodes are operated in a "share nothing" mode. Which means that each node is responsible to pull it's data from different places (wikidata, categories dumps, ...). This total independence between the nodes is a very nice property in term of reliablity, easy management, etc... but it also means that there is no single write interface to any of our WDQS clusters.
  • As part of reimaging a WDQS node, we do a full data reimport, starting with a dump of wikidata, and replaying updates from the date of the dump (either from recent changes or from kafka). Any additional data source needs to have a way to be replayed as well.

The data model for a constraint violation would be a single triple pointing from the violating statement to the constraint it violates.

A few notes on load / scale: last week (2018-04-12 – 2018-04-19), there were some 121.3K requests to the wbcheckconstraints API, according to Grafana. Assuming the vast majority of them was for a single entity, that comes out to approximately 17K writes per day, or one every five seconds. We don’t have data on how many violations there are per constraint check, but I would estimate the average to be somewhere between 10 and 100.

At the moment, we can’t make any promises on the completeness of this data anyways, so it might be okay if we don’t have the data after a reimport. Alternatively – might it be possible to get the data from another query service node? The data should be easy to identify – all triples with a special predicate (something like wikibase:violatesConstraint).

So, if you want to put a dataset into the database, here's the questions to answer:

  1. What kind of queries we are planning to run? Which use cases they would support? I am not sure I am clear on use case for "constraint violations for actors that live in Germany and are born before 1945" - what use case would produce such query?
  2. Do we need to have the data in WDQS at all? We have MWAPI gateway, maybe we could just query the suitable API?
  3. Is this data set separate from Wikidata data or needs to be in the same namespace (depends on cross-querying needs)?
  4. What is the data model (would be nice to have a wiki page describing it)?
  5. How the data are updated - when update happens, what triggers it, which data are updated, how soon we need the updates, etc. Note that there is no external push write interface to the database, by design, and having it would involve significant security hurdles to clear - to ensure that only authorized clients can modify the data, and only the part of the data they are authorized to. As Blazegraph does not have support for users/roles and other access controls, we may have to find some solution to it.
  6. How these data would be imported/reimported if node is reimaged? Right now WDQS is designed as a secondary data storage - i.e. it does not store any data which does not have primary source, and can be cleaned up and restored from external sources.
Jonas added a comment.Apr 23 2018, 7:58 AM

So, if you want to put a dataset into the database, here's the questions to answer:

  1. What kind of queries we are planning to run? Which use cases they would support? I am not sure I am clear on use case for "constraint violations for actors that live in Germany and are born before 1945" - what use case would produce such query?

The use case is similar to the existing maintenance queries. A user wants to keep his domain or project clean and achieve the 100% criteria.

  1. Do we need to have the data in WDQS at all? We have MWAPI gateway, maybe we could just query the suitable API?

Storing it somewhere else will not allow to scale that easily and be flexible with the queries at the same time.

  1. Is this data set separate from Wikidata data or needs to be in the same namespace (depends on cross-querying needs)?

Yes, cross querying is needed.

  1. What is the data model (would be nice to have a wiki page describing it)?

@Lucas_Werkmeister_WMDE could you please provide a draft.

  1. How the data are updated - when update happens, what triggers it, which data are updated, how soon we need the updates, etc. Note that there is no external push write interface to the database, by design, and having it would involve significant security hurdles to clear - to ensure that only authorized clients can modify the data, and only the part of the data they are authorized to. As Blazegraph does not have support for users/roles and other access controls, we may have to find some solution to it.

When a constraint check is executed for an Item the result will be stored and the old result for that Item will be deleted.
Access will only be allowed from within the cluster.

  1. How these data would be imported/reimported if node is reimaged? Right now WDQS is designed as a secondary data storage - i.e. it does not store any data which does not have primary source, and can be cleaned up and restored from external sources.

It will never be imported and it will never be complete.
It is just a snapshot.

When a constraint check is executed for an Item the result will be stored and the old result for that Item will be deleted.

So, constraint checks are bound to existing items? I wonder if it would be possible to expose them in the same way page properties are exposed.

It will never be imported and it will never be complete.

It is just a snapshot.

But a snapshot of what? Right now we can (and do) delete the whole database, reimage any host and restore it from latest dump + RC feed + category dump. However, neither of those contains constraint checks data. So how this would be managed?

Access will only be allowed from within the cluster.

Still, this means we need to create non-local write interface that previously didn't exist, and put access controls (by IP or otherwise) to it. Will need to research how easy that would be...

So, constraint checks are bound to existing items? I wonder if it would be possible to expose them in the same way page properties are exposed.

Yes, kinda. The problem is they are very expensive and it can take multiple seconds or even a minute for them to be executed.
That is why we cannot put them to the page properties AFAIK.
We currently execute them when a logged in user visits the item page.
It is done with a widget that calls the wbcheckconstraints API

But a snapshot of what? Right now we can (and do) delete the whole database, reimage any host and restore it from latest dump + RC feed + category dump. However, neither of those contains constraint checks data. So how this would be managed?

It is ok if they are deleted. They will be repopulated over time when users execute constraint checks.

So from what I understood you would prefer a pulling mechanism similar to recent changes. Is that true?
Pulling should not be a problem. We haven an API and we could notify you via event bus.
Important would be that you only pull cached results, because calculating them is very expensive.
Does that sound good?

It is ok if they are deleted. They will be repopulated over time when users execute constraint checks.

That would be only true for the entities that have been visited by logged in users since the deletion. It could take weeks or even months to have the required entities in the cache to be able to find and fix all the constraint violations "for actors that live in Germany and are born before 1945" using the queries. Even if the violations are stale for some time it would be nice having a job or something going through the entities creating and caching reports for the entities that don't have one stored. At least from the place where WDQS will be pulling that info.

Jonas added a comment.EditedApr 25 2018, 6:30 PM

That would be only true for the entities that have been visited by logged in users since the deletion. It could take weeks or even months to have the required entities in the cache to be able to find and fix all the constraint violations "for actors that live in Germany and are born before 1945" using the queries. Even if the violations are stale for some time it would be nice having a job or something going through the entities creating and caching reports for the entities that don't have one stored. At least from the place where WDQS will be pulling that info.

Sorry, I know it is not optimal, but would you rather want to wait for the ideal solution or have the querying now?
We can later still fix the restore part and there are multiple workarounds we could use meanwhile...

Micru added a subscriber: Micru.Apr 26 2018, 7:09 AM

The data model would be fairly simple – a single triple, as I already mentioned.

wds:Q42-45E1E647-4941-42E1-8428-A6F6C848276A wikibase:hasViolationForConstraint wds:P463-6F6E17F0-2650-4835-9250-2F893C77B301.

Both the subject and the object are statement nodes. The exact name of the predicate is still up for discussion. To keep the data model simple, I think we can squash constraint violations for the main snak, qualifiers, and references all into the same predicate (which is why I’m proposing “has violation” and not “violates” now), and people can see where exactly the violation is when they visit the entity.

Example query (the one in the task description: mandatory constraint violations for “IMDb ID” on actors living in Germany born before 1945):

SELECT ?item ?itemLabel ?constraintTypeLabel WHERE {
  wd:P345 p:P2302 ?constraint.
  ?constraint ps:P2302 ?constraintType;
              pq:P2316 wd:Q21502408.
  ?item wdt:P31 wd:Q5;
        wdt:P106 wd:Q33999;
        wdt:P551/wdt:P17 wd:Q183;
        wdt:P569 ?dob.
  FILTER(?dob < "1945-01-01"^^xsd:dateTime)
  ?item p:P345/wikibase:hasViolationForConstraint ?constraint.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

There are some more requested example queries in T172380: Query constraint violations with WDQS (I just made that task a parent task of this one).

abian added a subscriber: abian.May 1 2018, 12:52 PM
hoo added a subscriber: hoo.May 8 2018, 1:44 PM

After discussion with @Jonas here's what we can do now without very major efforts:

  1. We make wbcheckconstraints API to produce RDF representation (also TBD) of the checks
  2. We create a parameter for wbcheckconstraints to only deliver results if they can be delivered fast (e.g. already cached)
  3. WDQS Updater, when updating an edited item, will also pull the API above and load the constraints data, and join it with the rest of the data.

This means it will only be updated when the item is edited, and only in the case if the constraint check can prepare the cached data by the time Updater gets to it. It also has a race condition where one server could hit the Wikidata before constraints are ready and the other - after, so the servers will have different data. We will need to see whether it is a real concern in production. But we could at least try this one as a prototype.

This means it will only be updated … if the constraint check can prepare the cached data by the time Updater gets to it.

The updater usually doesn’t take more than a few seconds to reach an item, right? I’m skeptical whether this will be possible…

Also, the updater needs to learn how to remove the old constraints data. (I guess it already knows how to remove other old data from the item, so hopefully that shouldn’t be too difficult.)

The updater usually doesn’t take more than a few seconds to reach an item, right?

Yes.

I’m skeptical whether this will be possible…

Then we need a different model.

Also, the updater needs to learn how to remove the old constraints data.

That's not a problem, we do the same thing for the rest of the RDF data.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.May 8 2018, 8:18 PM
Multichill added a subscriber: Multichill.
Lucas_Werkmeister_WMDE set the point value for this task to 8.May 15 2018, 1:39 PM
This comment was removed by Smalyshev.
This comment was removed by Smalyshev.

Change 434015 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] [WIP] [DNM] Add loading constraints data

https://gerrit.wikimedia.org/r/434015

Smalyshev moved this task from Next to Doing on the User-Smalyshev board.May 20 2018, 10:27 AM
Smalyshev moved this task from Doing to Next on the User-Smalyshev board.May 29 2018, 7:49 PM
Smalyshev moved this task from Next to Doing on the User-Smalyshev board.Wed, Jun 27, 11:53 PM
Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.
Smalyshev moved this task from Doing to In review on the User-Smalyshev board.Sat, Jun 30, 5:02 AM

Change 434015 merged by jenkins-bot:
[wikidata/query/rdf@master] Add loading constraints data

https://gerrit.wikimedia.org/r/434015

Change 445454 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable fetching constraints for Updater

https://gerrit.wikimedia.org/r/445454