Page MenuHomePhabricator

[SPIKE] Making constraint violations queryable on the Query Service
Open, Needs TriagePublic

Description

Currently, constraint violations cannot be queried as they require too much computation to register in the query service when an edit is made.

There is proposed solution to this is to store WikibaseQualityConstraint check data in persistent storage instead of in the cache T204024 and T214362.

But as this solution was proposed in 2018, we should have an investigation that looks into the problem space and if there are any other approaches to get constraint violations onto the Query Service.

Acceptance Criteria

  • A well prepapred meeting which reports on problem space and any proposed solutions
  • Follow up tasks created and attached as subtasks

Note: One developer should prepare a summary of the situation to then bring to a group session. During the group session we'll identify with options will need (or should get) additional investigation and an additional timebox will be created for this

Timebox for initial summary: 2-4 hours?

Event Timeline

Arian_Bozorg renamed this task from Making constraint violations queryable on the Query Service to [SPIKE] Making constraint violations queryable on the Query Service.Thu, Feb 5, 11:54 AM

Meeting scheduled for Wednesday Feb 18

Notes for discussion on 18.02.2026

the current situation

  • WikibaseQualityConstraints
    • stores the results of its checks in a cache, and seemingly *only* in a cache. There is no guarantee about how long they remain in the cache
    • Constraints are queryable by item id or claim id using the wbcheckconstraints api. In this api call, it first checks the cache. If nothing is found (or out of date info is found? unclear to me) it performs checks for the specified items/claims and stores the results in the cache.
    • constraint checks are triggered by edits to an item and its statements
    • presumably constraint checks are also triggered by changes to constraint statements themselves on properties. For this to be complete, all statements for the affected property must be found and then checked.
    • there is also an api for checking the parameters of constraints: wbcheckconstraintparameters
  • Query Service
    • it's not clear to me exactly how data becomes queryable by the query service, but this statement from [T214362] suggests that dumps are a crucial step in how it happens: "Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS."
  • Constraint report
    • This report is generated by a bot, and is used by editors to find constraint violations to fix: Wikidata:Database reports/Constraint violations. It may provide some insight into how users will want to query constraints in the query service.

my thoughts on this so far
It seems to me that if we want a persistent store of constraint violations, a database table makes sense. There's a note in the description that we should investigate whether there are other approaches we should explore. I'm absolutely unclear on what those other approaches could or should be. That said, I didn't manage to wrap my head around how the Query Service gets its data, so maybe there is something obvious I'm missing.

The following section maps out what I imagine the next steps are, assuming that we do want to add a database table for constraint violations.

where do we go from here?
deal with the database

  • define the schema of a new table
  • write tickets for the steps for doing a migration, both technically and process-wise. (this page looks helpful: wikitech: Schema changes)

sort out making the constraints accessible to the query service

  • investigate how exactly data is made accessible to the query service (or someone who already knows this can explain to the rest of us)
  • define the steps necessary to take data from a database table and make it accessible to the query service (does this happen automatically if we have a table?)

populate the new table with all existing constraint violations

  • make a plan and tickets for how to do this.
    • I imagine it's something like: run constraint checks for every item (the api limits to 50 item ids at once), transform the results to match the table schema, and insert those into the table
    • open questions: Do we have precedents for this? Where does such a thing run? How long would it take? How would it handle edits that occur (potentially changing the violations) while it's running

modify WBQC to use the database instead of (just) the cache

  • determine if it should still use the cache (I assume yes), and if anything about the cache needs to change when there's also a persistent store
  • when a constraint check runs, save the result to the database
  • when querying constraint violations, use the database (or cache + database)
  • plan backwards-compatibility in the code rollout. The code changes likely need to initially work with and without the database table in place, and then remove backward compatibilty after the database migration is complete.

I can add a little bit:

Constraints are queryable by item id or claim id using the wbcheckconstraints api. In this api call, it first checks the cache. If nothing is found (or out of date info is found? unclear to me) it performs checks for the specified items/claims and stores the results in the cache.

Yes, there’s a check whether the data is outdated or not.

presumably constraint checks are also triggered by changes to constraint statements themselves on properties. For this to be complete, all statements for the affected property must be found and then checked.

Nothing is proactively triggered here, but the “outdated” check includes whether any of the constraints were edited since the cached result was stored. So the next wbcheckconstraints check that comes in (whenever it happens) will discard the cached data for that item and recheck the constraints with the new constraint definition.

it's not clear to me exactly how data becomes queryable by the query service, but this statement from [T214362] suggests that dumps are a crucial step in how it happens: "Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS."

The old system was that, whenever the query service updater got new data for an edited item, it would also check ?action=constraintsrdf. This action would check if up-to-date constraint check results are in the cache, and if yes, return that (otherwise return nothing, not do a fresh constraint check, so that the action is fast). So this was a best-effort system. To somewhat improve the chance that the query service updater would see cached data that could be added to the query service, some percentage of edits would trigger a constraint check in the backend. However, the query service parts of this were removed some years ago as part of the move to a new updater, so ?action=constraintsrdf still exists but is currently unused.

This report is generated by a bot

Just to spell it out: this bot is completely separate from our implementation and has nothing to do with WBQC. (It predates WikibaseQualityConstraints, but also is missing some features WBQC has, such as checking constraints on qualifiers and references.)

The query service likes (needs?!) data to be idempotent for a certain revision of an entity (for facilitating updates by difference).

it's not clear to me exactly how data becomes queryable by the query service

Several years ago we just push some triples (using wikibase:hasViolationForConstraint) to WDQS from cached constraint check results after every edit via old updater (though constraint check not always run after every edit before August 2021, so old WDQS updater may read outdated cached result or no cached result at all). This is added in T192567 (2018) and removed in T274982 (2021).

we want a persistent store of constraint violations, a database table makes sense.

Note this new table will have 155 million rows for a 2021 version of Wikidata (and even larger today). See T201150#7355393

The query service likes (needs?!) data to be idempotent for a certain revision of an entity (for facilitating updates by difference).

Constraint definitions on properties may change, so we still need to periodically rerun constraint checks on all items. See T201150#7351510. (Since the number of total items to run is huge, this should better be accomplished via something like Hadoop.)

meeting notes:

additional context

  • Data gets into the query service via the Query Service Updater
  • The updater used to import constraint violation data. The results were inconsistent and this was turned off
  • The assumption about when constraint violations for a item can change is incorrect. Violations can arise/resolve/change for an unedited item based on changes to a completely different item. An example is the unique values constraint, where changes anywhere in wikidata could affect any other item
  • how are dumps involved? While changes are continually ingested into the query service via the updater, things sometimes go wrong, making it necessary to reload the data. This happens roughly once or twice a year, and the process can take about a week. Therefore, there will need to be dumps of violation data.
  • most checks are done in php without involving the query service at all
  • Having perfect, complete, and up-to-date violation info for all statements is not really possible. So, anything we do here will have to be a best-effort kind of thing.

ideas, thoughts, discussion

  • Ideally it would be possible to update violations in the query service without updating the item. For this to be possible the query service updater would need to know more than just the last time the item was edited. Use timestamps maybe?
  • idea for the database table: query service doesn't need the full results. Doesn't need the message or anything. Database table could just contain the triples and be a lot smaller. What it wouldn't have is all the caching information, but we wouldn't really want this. Might not need anything other than statement ids. ( explored further here: [T417758#11627241])
  • List of all the constraints? Does it make sense to classify them in terms of ease of doing this? -> probably not. We want to build somthing that works for all constraint types

further investigation needed:

next steps