Page MenuHomePhabricator

Write down Lucas’ proposal for storing constraints and making them queryable on the query service
Open, Needs TriagePublic

Description

In the meeting mentioned in T416579#11625247, Lucas had a half baked idea; bake that a bit more and write it down.

Event Timeline

My thinking is that we have two different “consumers” of constraint violations: human visitors (editors) of Wikidata, and the query service updater (which eventually makes the results available in the query service, but it’s more useful to think about it in the updater’s terms). And I think they have quite different requirements: humans would like to see the latest results for an entity, with all the details in the constraint violation message; the query service updater needs consistent results (so that it can produce useful RDF diffs), but it only needs pairs of statement IDs (for the ?statement wikibase:hasViolationForConstraint ?constraintStatement triples) and not the rest of the information (unless we want to make more information available in the query service than before). And we should tailor the new data structure (database table) towards the needs of the updater, as we already have a working system for humans.

This leads me to suggest the following table schema:

CREATE TABLE /*_*/wbqc_violations (
  wbqcv_entity_id VARBINARY(255) NOT NULL, -- type matches wbc_entity_usage.eu_entity_id
  wbqcv_timestamp BINARY(14) NOT NULL, -- type matches revision.rev_timestamp
  wbqcv_statement_id VARBINARY(63) NOT NULL, -- type matches wbqc_constraints.constraint_guid
  wbqcv_constraint_statement_id VARBINARY(63) NOT NULL, -- type matches wbqc_constraints.constraint_guid
  INDEX wbqcv_timestamp (wbqcv_timestamp),
  PRIMARY KEY(wbqcv_entity_id, wbqcv_timestamp, wbqcv_statement_id, wbqcv_constraint_statement_id) -- or a wbqcv_row_id BIGINT AUTO_INCREMENT NOT NULL?
) /*$wgDBTableOptions*/;

The contents of this table are not sufficient to show violations to humans, so the existing cache of constraint check results will remain in place. Whenever constraints are checked (e.g. because a human visited an item and a wbcheckconstraints request was triggered, or because CheckConstraintsJob ran after an edit based on $wgWBQualityConstraintsEnableConstraintsCheckJobs + $wgWBQualityConstraintsEnableConstraintsCheckJobsRatio), WBQC does the following:

  1. Check if cached constraint check results are available and still up-to-date. If yes, return those and we’re done. (This only applies to wbcheckconstraints – in the job, the cached results are guaranteed to be outdated due to the edit that triggered the job.)
  2. Actually check constraints. (This is potentially quite slow.)
  3. Put the result in the cache.
  4. Append (only!) the violations, if any, to the wbqc_violations table. (This might need to happen in a deferred update, given that we might be in a GET request? Or we turn wbcheckconstraints into a needs-POST action, I suppose.) The wbqcv_timestamp represents when the constraint check was done, i.e. it’s the current time (unrelated to any revision timestamp, given that results can change at any time).
  5. Fire an EventBus event to inform the updater that new violations are available. (Do this even if the result was “no violations”, so that the updater can remove outdated violations.)

The query service updater listens to this new event topic, in addition to mediawiki.revision-create (see docs). When it sees a “new violations” event, it makes requests to /w/index.php?title=Q42&action=constraintsrdf&timestamp=PREVIOUS + /w/index.php?title=Q42&action=constraintsrdf&timestamp=NEW, diffs the RDF received from those, and sends the required update to the query service proper. (The new timestamp comes from the event; the old timestamp is… part of the updater’s internal state, just like the old revision ID for an edit? or also part of the event?) action=constraintsrdf only has to copy the information from wbqc_violations to the response (and the response can be marked as very cacheable, by the way).

For bootstrapping the query service during a data reload, we also need dumps of constraint violations. These will be based on the contents of wbqc_violations. (The dump contents will not just be the RDF triples of the constraint violations, as the updater will need to know the timestamp, right?)

Finally, a periodic job removes all wbqc_violations rows that are older than a certain time period (2 weeks? 2 months? depends on what the updater + data reload procedure need, I think). This is the only time the table is shrunk again – new constraint violation results are always appended, not overwriting anything, so that the updater can get both the old and new results (calling action=constraintsrdf with different timestamps) and diff between them. (In Wikimedia production, I expect this cleanup would be a systemd timer, but for third-party wikis we’d probably also want to include it in some MediaWiki job, just as a failsafe to prevent the table from growing forever if nobody set up the timer/cronjob.)

3rd party tools may want to consume constraint violation data, and a way to differ most recent data and older ones is welcome. (Public tables are replicated to cloud.)

The dump contents will not just be the RDF triples of the constraint violations, as the updater will need to know the timestamp, right?

So we need a way to represent a check result with no violation. (We may add a new "meta" table for each entities and timestamp of last check, but since there are 1.64 violations per item in average, this will result in a table with size compariable with the current proposed one.)

The dump contents will not just be the RDF triples of the constraint violations, as the updater will need to know the timestamp, right?

So we need a way to represent a check result with no violation. (We may add a new "meta" table for each entities and timestamp of last check, but since there are 1.64 violations per item in average, this will result in a table with size compariable with the current proposed one.)

I’m not sure that’s necessary. My thinking was that during normal updating, the updater will see a “new violations for entity ID at timestamp” event, and ask action=constraintsrdf for the violations; if there are no rows with that timestamp in the table, the updater can interpret the empty response as “no violations”. And I think this should also work when setting up the query service from a dump – if I understand correctly, the updater works from the backlog of old events in that mode. But we should probably check this with the WMF folks who understand the updater and its needs better.

If we do need “no violations” represented in the database, we can always encode that as a table row where e.g. the wbqcv_statement_id is the empty string. (Or NULL, though that means the table needs more storage space. Given that we know a valid statement ID is never empty, IMHO it would be better to use an empty string as the “no violations” representation.)

the updater can interpret the empty response as “no violations”.

The issue is if we need to reload the data, then we should make sure we load nothing if there is no violations in the item, not loading the latest nonempty constraint violations.