At present, we have several kinds of uniqueness constraints:
- a property's label must be unique (per language, among properties)
- an item's combination of label and description must be unique (per language, among items, if a description is given)
- an item's sitelink must be unique (per target wiki).
Each of these is implemented separately; some are checked on every save, some are checked only on creation and modification of the respective part of the entity. Some checks are expensive or awkward (the label+description uniqueness requires a self-join on a big table with a very complex condition).
Proposed solution:
- We add a "fingerprint" table, with two columns: fp_entity and fp_identity. fp_entity holds the id of the entity that fingerprint belongs to, fp_identity holds the fingerprint (as a string, which may be a hash). There's a composite unique key over both columns, and a separate index on the fb_identity column.
- To check for conflicts, we compute all fingerprints of the candidate, and check if we find any of them in the database. If so, there is a conflict.
- Fingerprints can be computed as sensitive or insensitive as we like (by e.g. converting to lower case or stripping whitespace before hashing)
- as an added bonus, we can look up any entity by a fingerprint (e.g. items by sitelink or properties by label) without touching the terms table.
Example fingerprints for the three use cases mentioned above:
- property label: $fp_fingerprint = "label:{$language}:{$hashOfLabel}";
- item label+description: $fp_fingerprint = "label+desc:{$language}:{$hashOfLabelAndDescription}";
// Note: if there is no description in that language, no fingerprint is generated for that language
- item sitelinks: $fp_fingerprint = "sitelink:{$site}:{$hashOfPageTitle}";