Page MenuHomePhabricator

Support "separators" parameter for distinct value constraints
Closed, ResolvedPublic8 Estimated Story Points

Description

As an editor I want to not have unique value constraint violations for some specific cases in order to not trigger unnecessary constraint violations.

Problem:
The unique value constraint currently doesn't allow for a separator parameter like the single-value constraint does. This is useful for cases where two Items can intentionally share the same external identifier and those identifiers are separated by a specific qualifier.

Example:

BDD
GIVEN two entities with the same external identifiers for the same external identifier Property
AND the Property has a unique value constraint
AND the unique value constraint has a separator qualifier
WHEN the external identifier statements have a qualifier with the right separator Property
AND the value for the separator qualifier is different on the entities
THEN no constraint violation is triggered

Acceptance criteria:

  • unique value constraints can take into account a separator to not trigger constraint violations in some cases
  • this works on all entity types, not just Items

Notes:

  • We consider it enough if either of the statements has the separating qualifier on it

Original report:
There is often not quite a one-to-one mapping between entities and external identifiers which causes entities to create constraint violations for the single value and distinct value constraints.

When a single entity has multiple identifiers, we can use the separator property on the constraint to specify which qualifiers can be used to make the entity not trigger a constraint violation (see T173594). I would like to have a similar thing for the distinct value constraint which would allow us to use qualifiers like identifier shared with on the statements instead of having to list the entities as exceptions to the constraint.

Event Timeline

I’m not convinced by the Duden ID example – “identifier shared with” isn’t really semantically a separator, in my opinion – but there are some other properties with separators on distint-values constraints already, some of which make more sense to me, such as mobile country code (separated by start time, end time, point in time) or PRONOM file format identifier (separated by software version).

I created this ticket is because I want a way to resolve the constraint violations for the Duden ID property. I don't care whether we use the existing "separator" property (renaming it if necessary) or whether we add a new property which does the same thing, but I think a proposal for a new property would probably be rejected.

I don't understand what the semantic distinction you're making is - why would PRONOM make sense if Duden doesn't?

It could be interesting to get this to work for identifiers that are re-assigned, e.g. ISO country codes or airport codes.

Start/end date qualifiers differentiate them. An "identifier shared with" qualifier could work too.

Addshore set the point value for this task to 8.Aug 18 2021, 10:43 AM

Some tech notes:

First of all, the distinct values constraint type is supported on qualifiers and references (T175561), but I assume the separators parameter only applies on the main snak.

I’m also not sure how this should best be implemented. I can see two options: in SPARQL, or after SPARQL.

Currently, we find items with the same value using a SPARQL query (SparqlHelper::findEntitiesWithSameStatement()), with a hard-coded LIMIT 10: the constraint violation message will display at most ten other items with the same value. (IIRC, lists of items in constraint violation messages also generally get truncated at some number, possibly 10 as well. Maybe the SPARQL limit should be 11 instead of 10 so that we can distinguish between “exactly 10” and “more than 10”…) We could take those up to 10 other items, load them in PHP, and check if they have different qualifiers for the separator property; if yes, then there’s no violation. The problem with this is that there could be many items with the same value but different qualifiers, and also some items that are genuine constraint violations, but unless those few items happen to be returned in the first 10 query results, we won’t find them. (On the bright side, this is a false positive – a missing constraint violation – which is the preferable kind of error here. A false negative, constraint violation when everything’s actually okay, is worse.)

The other option would be to check the separators directly in the SPARQL query; the downside is that this makes the query more complicated, especially if there are several separator properties. I feel like it will be pretty tricky to ensure it behaves exactly the same as the PHP implementation of separators (which is in MainSnakContext::getSnakGroup() and related methods).

Hm, maybe it’s possible in SPARQL after all. I think adding this snippet for each separator property ID ought to be equivalent to the MainSnakContext methods (in particular haveSameQualifiers):

MINUS {
  ?statement pq:$separator ?qualifier.
  MINUS {
    ?otherStatement pq:$separator ?qualifier.
  }
}
MINUS {
  ?otherStatement pq:$separator ?qualifier.
  MINUS {
    ?statement pq:$separator ?qualifier.
  }
}
MINUS {
  ?statement a wdno:$separator.
  MINUS {
    ?otherStatement a wdno:$separator.
  }
}
MINUS {
  ?otherStatement a wdno:$separator.
  MINUS {
    ?statement a wdno:$separator.
  }
}

That is, if the current statement has a non-“no value” qualifier and the other statement doesn’t have the same qualifier, or vice versa, or the current statement has a “no value” qualifier and the other statement doesn’t, or vice versa, then the statements are considered different, are not returned from the query, and don’t generate a violation. (“Unknown value” is always considered to be different from other “unknown value”, both in the query service and in the constraints extension, so an “unknown value” qualifier automatically means the statement is separated from all other statements, which is what we want.)

I haven’t tested this query yet; in particular, we’d need to make sure if nested MINUS works this way. Or maybe use FILTER NOT EXISTS instead.

I guess the second half, for “no value”, could also be:

FILTER(EXISTS { ?statement a wdno:$separator. }
    == EXISTS { ?otherStatement a wdno:$separator. })

Change 715707 had a related patch set uploaded (by Tobias Andersson; author: Tobias Andersson):

[mediawiki/extensions/WikibaseQualityConstraints@master] Add separator

https://gerrit.wikimedia.org/r/715707

Change 715707 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Add separators to UniqueValueChecker

https://gerrit.wikimedia.org/r/715707