Page MenuHomePhabricator

Re-implement uniqueness constraint in a consistent and efficient way
Open, MediumPublic


At present, we have several kinds of uniqueness constraints:

  1. a property's label must be unique (per language, among properties)
  1. an item's combination of label and description must be unique (per language, among items, if a description is given)
  1. an item's sitelink must be unique (per target wiki).

Each of these is implemented separately; some are checked on every save, some are checked only on creation and modification of the respective part of the entity. Some checks are expensive or awkward (the label+description uniqueness requires a self-join on a big table with a very complex condition).

Proposed solution:

  • We add a "fingerprint" table, with two columns: fp_entity and fp_identity. fp_entity holds the id of the entity that fingerprint belongs to, fp_identity holds the fingerprint (as a string, which may be a hash). There's a composite unique key over both columns, and a separate index on the fb_identity column.
  • To check for conflicts, we compute all fingerprints of the candidate, and check if we find any of them in the database. If so, there is a conflict.
  • Fingerprints can be computed as sensitive or insensitive as we like (by e.g. converting to lower case or stripping whitespace before hashing)
  • as an added bonus, we can look up any entity by a fingerprint (e.g. items by sitelink or properties by label) without touching the terms table.

Example fingerprints for the three use cases mentioned above:

  1. property label: $fp_fingerprint = "label:{$language}:{$hashOfLabel}";
  1. item label+description: $fp_fingerprint = "label+desc:{$language}:{$hashOfLabelAndDescription}";

// Note: if there is no description in that language, no fingerprint is generated for that language

  1. item sitelinks: $fp_fingerprint = "sitelink:{$site}:{$hashOfPageTitle}";



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:44 AM
bzimport set Reference to bz72430.
bzimport added a subscriber: Unknown Object (MLST).

Addendum: for selectively updating specific fingerprints of a given entity, e.g. all sitelinks, or the label+desc fingerprint in french, a prefix search on fp_identity can be used.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

On further though, the table should have three columns:

  • fp_entity -- the entity
  • fp_name -- identity name (e.g. "item-sitelink-enwiki" or "item-label-description-de")
  • fp_identity -- value or hash that is supposed to be unique over fp_name

Indxes should exist for (fp_name, fp_identity) and (fp_entity, fp_name).
We may also want a row_id field.

Calling this fingerprint is confusing, since we already use that word for a different concept. (Do point out how evil this is.)

I like the general idea proposed here.

Instead of "fingerprint", we could call it "unique feature".

Instead of "fingerprint", we could call it "unique feature".

Sounds sane.

at some point "unique" was changed to "distinct" on Wikidata constraints.