Page MenuHomePhabricator

Only check per-item constraints once, instead of with each statement
Open, MediumPublic

Description

Some constraints, like “has type” (type) or “has statement” (item), are really constraints on the item, not on the statement; if multiple statements all introduce the same constraint on the item (for example, parent, sibling, date of birth, etc. might all require type: human), we can optimize the constraint check by only checking that constraint once.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2017, 12:25 PM
thiemowmde triaged this task as Medium priority.Jul 4 2017, 8:59 AM
thiemowmde added subscribers: Jonas, thiemowmde.

My current idea is to do this after T171743: Introduce constraint check context abstraction with a new kind of context, which contains only an entity ID and no snak. We’ll have some kind of “scheduler”(?) that takes a plan of contexts and constraints to run, and downgrades the context for some constraint types to an entity-only context and then deduplicates the plan.

Also, I think the full list of constraint type that this applies to is just:

  • conflicts with
  • item requires statement
  • type

We probably also want to normalize constraint parameters before deduplication, by deterministically sorting the snaks (by property and value). I think this can be done without parsing the constraint parameters, just acting on the JSON as opaque blobs.

It should also be possible to deduplicate constraint checks on references. The same reference can be used in multiple statements, even across entities. (“imported from: English Wikipedia”, wdref:fa278ebfc458360e5aed63d5058cca83c46134f1 in RDF, is used in twelve million statements.)

Change 374592 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseQualityConstraints@master] Introduce ConstraintCheckPlan class

https://gerrit.wikimedia.org/r/374592

Change 392054 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseQualityConstraints@master] Add tests for ConstraintCheckPlan

https://gerrit.wikimedia.org/r/392054

I tried to find out how many constraint checks we could save with this, specifically for the type constraint, which is the most expensive one out of the three where we can apply this optimization. This query, if I’m not mistaken, tells us that on Q42 (Douglas Adams),

  • “instance of human” is checked 19 times,
  • “instance of human or fictional character” and “instance of human, group of humans, fictional character, or character that may or may not be fictional” are each checked 7 times,
  • “instance of human, fictional character, or animal” is checked 4 times,
  • two other constraints are checked 3 times each,
  • and seven more constraints are checked 2 times each.

This means that a total of 44 out of 89 type checks are redundant (query). Let’s run those numbers for a few other items:

  • Q183 (Germany): 394 out of 429 type checks redundant.
  • Q23 (George Washington): 69 out of 111 type checks redundant.
  • Q9368300 (current popular item): 2 out of 13 type checks redundant.
  • Q21558717 (current item with most statements – source): 5103 out of 5107 type checks redundant.
  • Q62444 (random item): 13 out of 31 type checks redundant.
  • Q681946 (another random item): 2 out of 5 type checks redundant.

Note: the second query, which returns the number of redundant and total type checks, is pretty crazy, and there might be some error in it. I’ve tried to comment it extensively, so I hope that someone else can look over it and check if it’s correct :)

Overall, it seems like with this optimization, we might be able to save, very roughly speaking, about half of all type constraint checks we perform. (Note, however, that this doesn’t help the “value type” constraint, which is equally expensive and more common.)

Addshore added a project: wikidata-tech-focus.
Addshore added a subscriber: Addshore.

Overall, it seems like with this optimization, we might be able to save, very roughly speaking, about half of all type constraint checks we perform. (Note, however, that this doesn’t help the “value type” constraint, which is equally expensive and more common.)

It sounds like this could be a very useful optimization when we start running these constraint checks more often.

@Lucas_Werkmeister_WMDE I'm guessing this optimization still needs to happen?

Yes, it hasn’t been done yet. There are some patches on Gerrit for my old ConstraintCheckPlan idea, but I’m not sure if that’s the best approach, and the patches themselves are probably stale beyond saving by now.