Page MenuHomePhabricator

Update `UniqueValueChecker` to query a list of endpoints
Closed, ResolvedPublic

Description

Problem
With the current work going on around testing the graph split, we investigated which of the constraint checks that are using SPARQL queries are affected.

In an investigation (see T355298 for details) we identified that if we do nothing, UniqueValueChecker will produce some false negatives.

So in order for this to run as required, we need to make UniqueValueChecker query a list of endpoints (and configure that list in production).

Acceptance Criteria

  • UniqueValueChecker runs as normal and produces no false negatives after the graph split

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1058598 had a related patch set uploaded (by Audrey Penven; author: Audrey Penven):

[mediawiki/extensions/WikibaseQualityConstraints@master] UniqueValueChecker queries additional endpoints

https://gerrit.wikimedia.org/r/1058598

Change #1058598 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] UniqueValueChecker queries additional endpoints

https://gerrit.wikimedia.org/r/1058598

The code is updated and should hopefully work; once we actually want to use the split graph servers, we’ll still need to update the production config accordingly. (Which might not be 100% trivial, by the way – currently, wgWBQualityConstraintsSparqlEndpoint is set to http://localhost:6009/sparql, not to a public URL like https://query.wikidata.org/sparql. So I’m guessing we’ll need to configure additional envoy proxies?) But at the moment, I don’t think the servers are up and running yet, see T364363.

I’m not sure where this belongs on the board in the meantime. Maybe waiting for deployment window?

Oh, or I guess it should stay in peer review (or in development?) while the strict_types patch is still open?

Change #1062965 had a related patch set uploaded (by Audrey Penven; author: Audrey Penven):

[mediawiki/extensions/WikibaseQualityConstraints@master] Use strict_types in recently modified files

https://gerrit.wikimedia.org/r/1062965

Change #1062965 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Use strict_types in recently modified files

https://gerrit.wikimedia.org/r/1062965

I think it makes the most sense to just close this task and make a separate one for the deployment (without which the task can’t really be verified): T374021: Make WikibaseQualityConstraints use split-graph query service

dcausse subscribed.

Re-opening because the approach taken might not work if the duplicates are spread across the list of endpoints, please see T374021#10458231 for a possible solution.

Hm, I think one consequence of the current broken query is also that distinct-values constraints are not checked on Commons (or rather, never return a result). Depending on how we implement the code changes, it might start to return Wikidata entities with the same value on Commons constraint check – but, unless we configure an authentication-less WCQS endpoint in the production config, it can never find other Commons (MediaInfo) entities.

My feeling is that such results would almost certainly not be useful. (Finding other Commons entities with the same value might be useful in some cases, though that’s maybe also worth discussing on-wiki.) So for now I think it might be best to completely disable the distinct-values constraint type on Commons, by mapping it to a non-existing item ID (similar to what we do for type constraints – see wgWBQualityConstraintsTypeConstraintId in IS.php). What do you think?

Hm, I think one consequence of the current broken query is also that distinct-values constraints are not checked on Commons (or rather, never return a result). Depending on how we implement the code changes, it might start to return Wikidata entities with the same value on Commons constraint check – but, unless we configure an authentication-less WCQS endpoint in the production config, it can never find other Commons (MediaInfo) entities.

My feeling is that such results would almost certainly not be useful. (Finding other Commons entities with the same value might be useful in some cases, though that’s maybe also worth discussing on-wiki.) So for now I think it might be best to completely disable the distinct-values constraint type on Commons, by mapping it to a non-existing item ID (similar to what we do for type constraints – see wgWBQualityConstraintsTypeConstraintId in IS.php). What do you think?

Good catch, I completely overlooked that constraint checks are running on commons too... I have the impression that we never expected this check to actually work on commons otherwise we'd have created an internal endpoint for it. So I'm all for explicitly disabling it to avoid future confusion.

Change #1112019 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/mediawiki-config@master] Disable distinct-values constraint checks on Commons

https://gerrit.wikimedia.org/r/1112019

Change #1112019 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable distinct-values constraint checks on Commons

https://gerrit.wikimedia.org/r/1112019

Mentioned in SAL (#wikimedia-operations) [2025-01-16T15:07:50Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1112019|Disable distinct-values constraint checks on Commons (T369079)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-16T15:13:36Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde: Backport for [[gerrit:1112019|Disable distinct-values constraint checks on Commons (T369079)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-16T15:18:42Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1112019|Disable distinct-values constraint checks on Commons (T369079)]] (duration: 10m 51s)

Change #1112198 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseQualityConstraints@master] WIP: Replace distinct-values SPARQL queries

https://gerrit.wikimedia.org/r/1112198

Change #1112242 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseQualityConstraints@master] Improve SPARQL query construction in SparqlHelper

https://gerrit.wikimedia.org/r/1112242

Unfortunately it looks like T369079 wasn’t enough, we’ll have to patch the code some more. Currently, the queries in findEntitiesWithSameStatement() and findEntitiesWithSameQualifierOrReference() assume that the subject entity can be found in the graph that we’re querying:

SELECT DISTINCT ?otherEntity WHERE {
  BIND(wds:$guidForRdf AS ?statement)
  BIND(p:$pid AS ?p)
  BIND(ps:$pid AS ?ps)
  ?entity ?p ?statement.
  ?statement ?ps ?value.
  ?otherStatement ?ps ?value.
  ?otherEntity ?p ?otherStatement.
  FILTER(?otherEntity != ?entity)
  MINUS { ?otherStatement wikibase:rank wikibase:DeprecatedRank. }
  $finalSeparatorFilter
}
LIMIT 10

I believe it was originally done this way so that we didn’t have to implement serializing the statement value into the query (though later I added getRdfLiteral() anyway, which we should definitely remove and instead use Wikibase’s RdfBuilder infrastructure). But this doesn’t work for a split query service. We need to refactor the query to not rely on the original entity, and instead directly serialize the statement value into it after all.

It turns out I already wrote much the same thing some four years ago: T269859: Constraint checks assume that query service has latest data

Also, in the above WIP change, I’m trying to construct queries that refer to value nodes by their hash (for the three data value types that get value nodes: globe coordinate, quantity, time), but I’ve now discovered that Wikibase and WDQS disagree about the prefix for those value nodes (v: vs. wdv:), which is going to make it hard to correctly generate those SPARQL queries: T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes

To test this locally, I used this code (in the dev tools console) to create a set of properties, one per data type:

1mw.loader.using( 'mediawiki.api' ).then( async () => {
2 let api = new mw.Api();
3 for ( const dataType of [
4 'commonsMedia',
5 'entity-schema',
6 'external-id',
7 'wikibase-form',
8 'globe-coordinate',
9 'geo-shape',
10 'wikibase-item',
11 'wikibase-lexeme',
12 'math',
13 'monolingualtext',
14 'musical-notation',
15 'time',
16 'wikibase-property',
17 'quantity',
18 'wikibase-sense',
19 'string',
20 'tabular-data',
21 'url',
22 ] ) {
23 await api.postWithEditToken( {
24 action: 'wbeditentity',
25 new: 'property',
26 data: JSON.stringify( {
27 type: 'property',
28 datatype: dataType,
29 labels: { en: { language: 'en', value: `unique ${dataType} property` } },
30 aliases: { en: [ { language: 'en', value: `${dataType} unique property` } ] },
31 } ),
32 } );
33 }
34} ).catch( console.error );

I then (manually) created two items with the same values for all those properties, loaded a full RDF export of my wiki into a local Blazegraph, configured WBQC to query that blazegraph, and checked constraints on one of those items via Special:ConstraintReport;
image.png (652×1 px, 343 KB)

Change #1112242 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Improve SPARQL query construction in SparqlHelper

https://gerrit.wikimedia.org/r/1112242

Change #1112198 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@master] Replace distinct-values SPARQL queries

https://gerrit.wikimedia.org/r/1112198

We’ll see if this works better than the previous version when we try to make WBQC use the split graph service again.

Change #1126949 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.20] Improve SPARQL query construction in SparqlHelper

https://gerrit.wikimedia.org/r/1126949

Change #1126950 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.20] Replace distinct-values SPARQL queries

https://gerrit.wikimedia.org/r/1126950

Change #1126951 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.19] Improve SPARQL query construction in SparqlHelper

https://gerrit.wikimedia.org/r/1126951

Change #1126952 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.19] Replace distinct-values SPARQL queries

https://gerrit.wikimedia.org/r/1126952

Change #1126949 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.20] Improve SPARQL query construction in SparqlHelper

https://gerrit.wikimedia.org/r/1126949

Change #1126950 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.20] Replace distinct-values SPARQL queries

https://gerrit.wikimedia.org/r/1126950

Change #1126951 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.19] Improve SPARQL query construction in SparqlHelper

https://gerrit.wikimedia.org/r/1126951

Change #1126952 merged by jenkins-bot:

[mediawiki/extensions/WikibaseQualityConstraints@wmf/1.44.0-wmf.19] Replace distinct-values SPARQL queries

https://gerrit.wikimedia.org/r/1126952

Mentioned in SAL (#wikimedia-operations) [2025-03-12T14:42:25Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-12T14:45:30Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-12T14:55:24Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1126949|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126950|Replace distinct-values SPARQL queries (T369079)]], [[gerrit:1126951|Improve SPARQL query construction in SparqlHelper]], [[gerrit:1126952|Replace distinct-values SPARQL queries (T369079)]] (duration: 12m 58s)