Page MenuHomePhabricator

Constraint checks assume that query service has latest data
Open, Needs TriagePublic

Description

As a Wikidata user, I want constraint check results to be up to date immediately after an edit.
As Wikidata developers, we want to enable running constraint checks after each edit, and store the result persistently (T204024).

Problem:
Some queries in Extension:WikibaseQualityConstraint’s SparqlHelper assume that the query service currently has the latest entity data. This is not true if a constraint check is run immediately after an edit, before the query service updater has run; more to the point, we eventually would like to run (almost) all constraint checks before the query service updater runs, so that the query service updater can pull in the results, and people can query for constraint violations in the query service.

Example:
I believe this affects several queries:

hasType()
ASK {
  BIND(wd:$id AS ?item)
  VALUES ?class { $classesValues }
  ?item wdt:$subclassOfId* ?class.$gearingHint
}

This should probably instead get the item’s classes from the data we have, and query if any of them are subclasses of the target classes.

findEntitiesWithSameStatement()
SELECT DISTINCT ?otherEntity WHERE {
  BIND(wds:$guid AS ?statement)
  BIND(p:$pid AS ?p)
  BIND(ps:$pid AS ?ps)
  ?entity ?p ?statement.
  ?statement ?ps ?value.
  ?otherStatement ?ps ?value.
  ?otherEntity ?p ?otherStatement.
  FILTER(?otherEntity != ?entity)
  $deprecatedFilter
}
LIMIT 10

(See below.)

findEntitiesWithSameQualifierOrReference()
SELECT DISTINCT ?otherEntity WHERE {
  BIND(wd:$eid AS ?entity)
  BIND($value AS ?value)
  ?entity ?p ?statement.
  ?statement $path ?value.
  ?otherStatement $path ?value.
  ?otherEntity ?otherP ?otherStatement.
  FILTER(?otherEntity != ?entity)
$deprecatedFilter
}
LIMIT 10

This one serializes the value into RDF, but then also links it to the subject in the query (?entity ?p ?statement. ?statement $path ?value.). I believe the reason for this is historical: if memory serves, findEntitiesWithSameStatement() was implemented first, and we didn’t want to (re)implement the serialization at the time, so instead of querying for “any other item with value "x"”, we queried for “any other item with value X, which is the P​123 value of item Y”, where Y is the item whose constraints we are checking. (In other words, we used that item as a way to bind the value we were looking for to a SPARQL variable, rather than writing out that value as a literal – because there are many different value types, and writing code to serialize each of them as a literal seemed cumbersome.) Then, later, we needed to implement findEntitiesWithSameQualifierOrReference() as well, and at this point we wrote the code that writes out a value as a literal – but we kept the other part of the query, which links the value to item Y, too. Nowadays I think we should just remove that part. (We should also reconsider if we really want to keep SparqlHelper::getRdfLiteral(), which I now think is effectively a partial reimplementation of Wikibase’s own RDF export; for example, it seems to me that you’ll get an exception if you try to have a “distinct values” constraint on a Math property.)

And as for findEntitiesWithSameStatement(), that one still doesn’t write out the value as a literal, and instead only gets the value through its relationship to the item; but I think we should fix that as well. I guess the salient point here is that Wikibase is perfectly capable of writing out values as literals (well, Turtle literals, but SPARQL and Turtle literals are sufficiently similar) and there’s really no good reason to avoid doing that.

Screenshots/mockups:

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

  • SparqlHelper’s SPARQL queries no longer rely on the data of the item whose constraints are being checked.

Open questions: