Page MenuHomePhabricator

Cache all constraint check results per-entity
Closed, ResolvedPublic

Description

Caching constraint check results per-constraint is probably too inefficient to be of any use, so we should probably cache the full results of checking constraints on one entity as one blob.

If there’s a cache hit, we can go over the cached results and, for each result, check if it’s potentially stale: “commons link”, “type”, “value type”, and “unique value” results are always potentially stale (and the response we send should indicate that, see T179844); results for all other constraint types will be known to be fresh if we set up correct purging. (This means that for most constraint types, we don’t need to show the user “this result may be stale”.)

If there’s a cache miss, we do the normal constraint check process, store the result in the cache, and also set it up so that it will be purged when the entity is edited, and (for “inverse”, “symmetric” and “target requires claim”) when certain other entities are edited. I’m not sure how best to do this – should we register some hook that explicitly purges the cache when those entities are edited, or should we record their revision IDs inside the cached blob and check if it’s still fresh when retrieving it?

(If the purging on edits of other entities turns out to be too difficult, we can instead add “inverse”, “symmetric” and “target requires claim” to the constraint types whose results may be stale.)

Related Objects

StatusSubtypeAssignedTask
ResolvedLucas_Werkmeister_WMDE
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedNone
ResolvedLucas_Werkmeister_WMDE
InvalidNone
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
DeclinedNone
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
OpenNone
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
ResolvedLucas_Werkmeister_WMDE
Resolvedjcrespo

Event Timeline

Oh, and another thing – I suppose we also need to invalidate the cached results somehow when a constraint statement is edited. I’m not sure what the most efficient way to do that is.

  • Explicitly purge cached results when a constraint statement is edited. This is almost certainly too expensive.
  • Store revision IDs of all properties with constraints in the cached result and reject the result if one of the properties has been edited since then. I’m not sure if this is efficient… constraint checks on Q42 seem to use 122 properties, can we get that many revision IDs in a single database SELECT?
  • Keep a global “constraint revision ID” which is only incremented when constraints change, store the current “constraint revision ID” in the cached result and reject it if it’s no longer up to date. However, due to T163465, we currently don’t know in UpdateConstraintsTableJob whether the constraints have actually changed or not, so we would increment that ID with every change to any property, which is actually fairly frequent according to current RecentChanges (about once every five minutes, on average). But I suppose we could also make UpdateConstraintsTableJob smarter – instead of unconditionally purging and re-importing the constraints, get the old ones, import the new ones, compare them, and only store an update if there’s a difference.
  • As a combination of the last two ideas, keep such a “constraint revision ID” per constraint (identified by its statement ID), instead of a single global one.

Okay, my current notion of the caching is: we use just the entity ID as cache key (plus some static WBQC prefix, of course). The value includes not just the serialized constraint check results, but also a map of page titlesIDs to revision IDs. When retrieving a cached value, we don’t even care what those titlesIDs are, we just check they’re all still up to date, and otherwise delete this value from the cache. When computing a value to cache, that map includes the entity currently being checked, all entities referenced in “inverse”, “symmetric” and “target requires claim” constraints (perhaps with some reasonable limit, 100 or so), and perhaps also all properties used in constraints (but I’m not sure about that yet, see above). We also hook into the ArticlePurge hook to purge an entity’s cached constraint check results when that entity is purged, but we don’t purge any other entities in that hook.

Regarding invalidation on constraint statement edits – the current median time of the time since any given property was last edited is almost 18 days, so just using the revision ID of the property should be acceptable for now.

Do we really want to cache all constraint check results… or only those that we will actually show in the gadget: warning, violation and bad-parameters?

For reference: Berlin (Q64) currently has the following distribution of result statuses (P6320):

1344 "compliance"
 196 "not-main-snak"
  80 "todo"
  73 "warning"
  18 "deprecated"
   1 "violation"

Of those 1712 results, the gadget only cares about 74. If we remove the rest, we massively reduce the size of the JSON we transmit and cache, which I suddenly find very tempting.

Of course, for this caching to still be correct, the API request has to explicitly indicate that it doesn’t care about the other results, and API requests without such an indication would not benefit from caching. Is that acceptable?

Here are the result statuses of 100 randomly chosen items (P6333):

"bad-parameters" 15
"warning" 66
"not-main-snak" 1601
"todo" 26
"compliance" 5096

It appears the average item has plenty of checkable snaks, but very few violations (which is great!). So I’m very much leaning towards the optimization mentioned in the previous comment: to only store those results that the gadget will display (here: 81 out of 6800).

And in that case, we should seriously consider using the same storage mechanism for both T180582: List of all constraint violations and this task, since they’ll contain the same information. (That is: for now, cache constraint check results; later, add a special page that lists cached constraint check results of a property across all entities that use it; and perhaps later still, add a job that periodically checks constraints on random items, so that the constraint results of the special page don’t depend on human editors visiting items with the checkConstraints gadget enabled.) I’m just not sure how best to do that.

The task is: we have a JSON blob (or, if you will, a PHP array – the constraint check results, in any event), which we want to store for some time (with the ability to explicitly remove it, either on page purge or because we detect it’s no longer valid). Currently, we will access it by entity ID (one blob per entity), but in the future, we will also want to find it by a list of property IDs, and possibly constraint IDs (statement IDs), that it involves. That list can be determined from the blob, but not with a simple string search.

Okay, I discussed this with @Ladsgroup and this would be the current draft:

One table from entity ID to cached constraint check results, perhaps called wbqc_result:

result_entity_idresult_json
Q42{"v":{"Q42":12345,"Q43":123456,…},"r":{…}}

I’m not sure if we use the entity ID as the primary key or add another column (either the numeric entity ID or perhaps an auto-incrementing counter). This also assumes that we’re storing the (in)validation info (the map of page IDs to revision IDs, see comment) inside the JSON blob – I suppose it could also be stored separately, but I’m not sure if that has any advantage.

Another table from entity ID to constraint IDs used in the constraint report. (I don’t have a good name for this one yet.)

result_entity_idconstraint_id
Q422
Q423
Q424
Q432

(Depending on the first “I’m not sure” above, result_entity_id may also be another column of wbqc_result.) constraint_id here is a numeric ID assigned to constraints, not the constraint_guid we currently have – we need to update the wbqc_constraints table to add this ID as a primary key, so that we don’t have to store and join against the >40-character string constraint GUID. See T180834: Add numeric primary key to wbqc_constraints table.

When checking constraints, we first get the cached results from wbqc_result (via result_entity_id). We parse the JSON and get the map of page IDs to revision IDs, and check the page table to see if the revision ID is still the page_latest for that page. If yes, we use the cached results, otherwise we do the whole constraint checking process and finally store the results back in wbqc_result.

When looking for all violations of a constraint (which could be part of T180582: List of all constraint violations), we join wbqc_result, the unnamed second table from above, and wbqc_constraints, and get the wbqc_result.result_json where the wbqc_constraints.constraint_guid is the given constraint GUID.

When looking for all violations of a property (which is T180582: List of all constraint violations as originally stated), we do the same thing but look for all constraints where wbqc_constraints.pid is the given property ID.

For reference, this is the current wbqc_constraints we have:

constraint_guidpidconstraint_type_qidconstraint_parameters
P10$1529A843-91A8-4FEA-B43F-9A3258446A3810Q21502404{"P1793":{"0":{"snaktype":…
P10$28240BCD-348B-45BD-9654-3B5B8E101FD610Q21510852{"P2307":{"0":{"snaktype":…
P1000$3727DD2D-2ED9-4D53-98EB-5618039508E31000Q21510865{"P2308":{"0":{"snaktype":…

constraint_guid is a PRIMARY KEY according to the SQL schema (though not one we can actually use, according to Amir), and there’s an index on the pid. (T180834 is the task for adding a separate constraint_id.)

@hoo: can you perhaps check if this makes sense?

Note: we’ve decided to go ahead with the ObjectCache for now, and store what would be wbqc_result above in the cache instead: see T181060: Cache constraint check results per-entity in ObjectCache (L) (days: 2). We can migrate that to the database, as described above, in the future. In the meantime, review of the above schema is most welcome!

There is also some prior art: an older version of the WikibaseQuality extension contained this table, as found in T102992: [Task] Review WikidataQuality DB schema:

CREATE TABLE IF NOT EXISTS /*_*/wbq_violations (
	entity_id                     VARBINARY(15)     NOT NULL,
	pid                           VARBINARY(15)     NOT NULL,
	claim_guid                    VARBINARY(63)     NOT NULL,
	constraint_id                 VARBINARY(63)     NOT NULL,
	constraint_type_entity_id     VARBINARY(15)     NOT NULL,
	additional_info               TEXT              DEFAULT NULL,
	updated_at                    VARBINARY(31)     NOT NULL,
	revision_id                   INT(10) UNSIGNED  NOT NULL,
	status                        VARBINARY(31)     NOT NULL,
	PRIMARY KEY (claim_guid, constraint_id)
) /*$wgDBTableOptions*/;

CREATE INDEX /*i*/claim_guid ON /*_*/wbq_violations (claim_guid);
CREATE INDEX /*i*/constraint_id ON /*_*/wbq_violations (constraint_id);

Here, each constraint violation gets a database row of its own. In the above form, this table is tied to the notion that constraints are only checked on the main snak of a statement (via the claim_guid column), which is no longer true since T168532: Check constraints on qualifiers and references. However, we could probably construct some alternative “address” that would still, in combination with the constraint ID, be a unique key for such a table. But I’m not sure if that’s an advantage.