Page MenuHomePhabricator

possible to violate label uniqueness constraint of property labels
Closed, ResolvedPublic1 Story Points

Description

Apparently it is possible to edit properties in a way that two different properties have the same label. This should not happen.

Two edits showing the issue:

At the same time adding a new label in another language (yi) was not possible and caused an error message saying there are uniqueness violations in 4 languages for those two properties.

Event Timeline

Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher raised the priority of this task from to High.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2015, 5:10 PM
Bene added a comment.Jun 11 2015, 5:21 PM

Note that this issue does not happen on test.wikidata.org. You can play around with https://test.wikidata.org/wiki/Property:P155 and https://test.wikidata.org/wiki/Property:P164. When entering the same label on both properties, I get the expected error message.

Is it possibly related to the fact that we are trying to edit an already conflicting label?

Bene added a comment.EditedJun 11 2015, 5:42 PM

Is it possibly related to the fact that we are trying to edit an already conflicting label?

I don't think so because after resolving all conflicts (adding ~) I am still able to add a duplicate label (cf. https://www.wikidata.org/w/index.php?title=Property:P1476&diff=222001356&oldid=222001328).

Can maybe someone look into our terms table in the database and check if all labels are tracked correctly?

Edit: See http://quarry.wmflabs.org/query/3967 and http://quarry.wmflabs.org/query/3968

Do I read those queries correctly that the terms table has them differently?

Bene added a comment.Jun 11 2015, 8:56 PM

As far as I can see the terms table contains the correct terms and search keys so I wonder how my edits could pass the uniqueness filters.

daniel added a comment.EditedJun 12 2015, 9:05 AM

I can only add that I, too, am quite confused. I can' t reproduce the issue locally, nor can I think of a way for this to happen, except for extreme replication lag (several minutes - which should not be possible). The uniqueness checks are run against the slave database. This is something we could change, and it has the potential to fix this. But it's a short in the dark, really.

Bene added a comment.Jun 12 2015, 10:12 AM

See also http://quarry.wmflabs.org/query/3972: it is possible that the same label gets tracked in wb_terms twice.

@Bene: yes, this is no database level constraint, so it's possible on that level. The question is just, how does it happen? Our business logic (the uniqueness validators) should prevent that.

aude added a subscriber: aude.Jun 12 2015, 10:59 AM

this is the query that is run and it is quite fast:

select term_entity_type,term_type,term_language,term_text,term_entity_id  FROM `wb_terms`   WHERE ((term_language='en' AND term_search_key='title' AND term_type='label' AND term_entity_type='property'))  LIMIT 10 \G;
*************************** 1. row ***************************
term_entity_type: property
       term_type: label
   term_language: en
       term_text: title
  term_entity_id: 1476
1 row in set (0.01 sec)
Bene added a comment.Jun 12 2015, 11:14 AM

The query posted by @aude gives us the expected result (cf. http://quarry.wmflabs.org/query/3973)

It seems that also label-description duplicates are possible: https://www.wikidata.org/w/index.php?title=Property%3AP357&type=revision&diff=222062411&oldid=222000959

aude added a comment.Jun 12 2015, 2:52 PM

What I see is that LabelUniquenessValidator requests all conflicting labels (in all languages), including self-conflicts on the same item, then filters out self-conflicts.

TermSqlIndex is what detects conflicts and it returns a maximum of 10 conflicts. If these are all self-conflicts (then cutting off results before getting to non-self conflicts), then they all get filtered out afterwards and LabelUniquenessValidator finds no conflict.

Easiest solution is to increase max conflicts, to say 500.

also, is there a need at all for the $ignoreEntityId option in LabelDescriptionDuplicateDetector::detectTermConflicts? if we always end up filtering them, then is it possible to have this done as part of the query and get rid of the option + post filtering.

Change 217849 had a related patch set uploaded (by Aude):
Increase max conflicts returned for conflict detections

https://gerrit.wikimedia.org/r/217849

Bene added a comment.Jun 12 2015, 2:55 PM

What are those "self-conflicts" actually? How can a label be found more than once? Don't we filter by entity type? Also, why don't we just filter out "self-conflicts" in the where clause?

aude added a comment.Jun 12 2015, 3:00 PM

@Bene terms being searched for conflicts included:

array (
  'fr' => 'titre (OBSOLETE -> P1476)',
  'en' => '(OBSOLETE) title (use P1476, "title")',
  'ja' => '題名 *廃止(P1476を使用)',
  'nl' => '(VEROUDERD) titel van publicatie',
  'it' => '(obsoleta) titolo (usare P1476)',
  'pt' => '(OBSOLETO) título (usar P1476)',
  'pt-br' => 'título (OBSOLETO)',
  'es' => 'título',
  'ko' => '원제목',
  'ca' => 'títol (OBSOLET, utilitzeu P1476)',
  'cs' => '(ZASTARALÉ) titul originálu',
  'gl' => 'título~',
  'hu' => '(ELAVULT) műcím (használd a P1476-ot)',
  'fa' => 'ﻊﻧﻭﺎﻧ',
  'ro' => 'titlu',
  'vi' => 'tựa đề',
  'sv' => 'titel~',
  'pl' => 'tytuł oryginalny',
  'de' => '(VERALTET) Titel',
  'el' => 'τίτλος',
  'zh-hans' => '(停用)标题字符串(请改用P1476)',
  'zh-hant' => '標題(字串)',
  'bs' => 'naslov (slovima)',
  'he' => 'שם מקורי',
  'uk' => 'назва мовою оригіналу',
  'nds' => 'Titel',
  'fi' => 'alkuperäisotsikko',
  'ka' => 'ორიგინალური დასახელება',
  'ru' => '(УСТАРЕЛО) название (используйте P1476)',
  'bn' => 'মূল শিরোনাম',
  'be' => 'назва на мове арыгінала',
  'sh' => 'originalni naslov',
  'nn' => '(FORELDA) originaltittel',
  'eo' => '(MALNOVA) originala titolo',
  'da' => '(FORÆLDET) titel',
  'sr' => 'наслов',
  'sr-ec' => 'наслов',
  'is' => 'upprunalegur titill',
  'mk' => 'изворен наслов',
  'oc' => 'títol',
  'zh' => '标题字符串(已废弃,请使用P1476)',
  'be-tarask' => 'назва',
  'ms' => 'tajuk',
  'nb' => '(FORELDET) tittel',
  'zh-tw' => '標題(字符串)',
  'lv' => 'nosaukums',
  'gu' => 'શીર્ષક~',
  'et' => '(VANANENUD) pealkiri',
  'zh-hk' => '標題(字符串)',
  'zh-cn' => '标题字符串',
  'hi' => 'शीर्षक',
  'te' => 'శీర్షిక',
  'or' => 'ନାମ',
  'sr-el' => 'naslov',
  'sco' => 'title',
  'sl' => 'naslov (string)',
  'ia' => 'titulo',
  'la' => 'titulus',
  'scn' => 'tìtulu',
  'eu' => '(ZAHARKITUA) izenburua (erabili P1476)',
  'mr' => 'शीर्षक',
  'yi' => 'טיטל (פארעלטערט)',
)

I got back results like:

array (
  0 =>
  Wikibase\Term::__set_state(array(
    'fields' =>
    array (
      'entityType' => 'property',
      'termType' => 'label',
      'termLanguage' => 'ko',
      'termText' => '원제목',
      'entityId' => 357,
    ),
  )),
  1 =>
  Wikibase\Term::__set_state(array(
    'fields' =>
    array (
      'entityType' => 'property',
      'termType' => 'label',
      'termLanguage' => 'ro',
      'termText' => 'titlu',
      'entityId' => 357,
    ),
  )),
  2 =>
  Wikibase\Term::__set_state(array(
    'fields' =>
    array (
      'entityType' => 'property',
      'termType' => 'label',
      'termLanguage' => 'vi',
      'termText' => 'tựa đề',
      'entityId' => 357,
    ),
  )),

as "conflicts". (got 10 of these, all entityId => 357)

i think those are questions for Daniel. it might be for performance reasons or something, but think there must be a better way.

Change 217849 merged by jenkins-bot:
Increase max conflicts returned for conflict detections

https://gerrit.wikimedia.org/r/217849

@aude I think you got that exactly right! Ouch. That was my fault. I somehow got it in my head that 10 conflicting items would always be sufficient. But it was 10 conflicting terms, which of course may all be self-conflicts.

The reason I introduced the post-filtering was to a) reduce the complexity of the already complex query and b) to keep validator knowledge out of TermIndex interface. But that interface should be broken up anyway.

Lydia_Pintscher closed this task as Resolved.Jun 14 2015, 11:36 AM
Lydia_Pintscher claimed this task.
Lydia_Pintscher reassigned this task from Lydia_Pintscher to aude.
Lydia_Pintscher set Security to None.
Lydia_Pintscher moved this task from Backlog to Done on the Wikidata-Sprint-2015-06-02 board.
Lydia_Pintscher edited a custom field.

Shall we backport this? The fix is tiny.

And for the record: the real solution for this would be T74430: Re-implement uniqueness constraint in a consistent and efficient way

aude added a comment.Jun 14 2015, 12:00 PM

@daniel can backport it tomorrow, and totally agree about the *real* fix.

Change 218308 had a related patch set uploaded (by Aude):
Increase max conflicts returned for conflict detections

https://gerrit.wikimedia.org/r/218308

Change 218308 merged by jenkins-bot:
Increase max conflicts returned for conflict detections

https://gerrit.wikimedia.org/r/218308

Change 218351 had a related patch set uploaded (by Aude):
Update Wikidata - fix property label constraint bug

https://gerrit.wikimedia.org/r/218351

Change 218351 merged by jenkins-bot:
Update Wikidata - fix property label constraint bug

https://gerrit.wikimedia.org/r/218351

aude added a comment.Jun 15 2015, 2:39 PM

deployed the fix to wikidata.