Page MenuHomePhabricator

Wikidata Elastic search drops results with matches in different language label
Closed, ResolvedPublic

Description

If search label matches in different language (e.g., you search in English but Spanish label matches), the highlighter does not return anything, because it checks only English matches and label_all matches, and for some reason label_all matches do not work for highlighter. Since highlighter does not return anything, we can not display the result, and the match is dropped from the results.

If we replace label_all with label.*, highlighter finds the label, but can not tell if it's label or alias, because we fetch only English labels. We could fix that by fetching all labels, but that looks like overkill.

Related Objects

StatusSubtypeAssignedTask
ResolvedWikidata-bugs
OpenNone
Resolvedaude
ResolvedSmalyshev
Resolvedaude
ResolvedNone
InvalidNone
ResolvedSmalyshev
ResolvedLydia_Pintscher
DuplicateSmalyshev
DuplicateNone
DeclinedNone
DeclinedNone
Resolveddaniel
ResolvedLydia_Pintscher
OpenNone
DeclinedNone
ResolvedSmalyshev
ResolvedSmalyshev
DeclinedNone
ResolvedSmalyshev
Resolveddcausse
Resolveddcausse
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse

Event Timeline

Change 371747 had a related patch set uploaded (by DCausse; owner: DCausse):
[search/highlighter@master] Add new formatter to output offsets text snippets

https://gerrit.wikimedia.org/r/371747

I think the solution here is to return offsets alongside the text snippets.
There is an offset gap of 1 between array elements so instance an array [ "", "Image" ] will have 1 as a starting offset for the query image.
The hack would be to send this highlight query:

  "highlight": {
    "fields": {
      "labels.en.prefix": {
        "type": "experimental",
        "options": {
        }
      },
+ additional preferred fallback languages with "skip_if_last_matched": true,
      "labels.*.prefix": {
        "type": "experimental",
        "options": {
          "skip_if_last_matched": true,
          "return_snippets_and_offsets": true
        }
      }
    }
  }

The new option return_snippets_and_offsets needs to be implemented but would output 1:1-XX:YY|Image, blah for the first alias and 0:0-XX:YY|Image, blah for labels. In short if the snippet string starts with 0 it's a label, anything else it's an alias.

The language chosen will the first one found in the mapping so it's why it's preferable to explicitly send a list of fallback language first if this is important.

skip_if_last_matched will make sure to stop early and not scan all languages.

Change 371747 merged by Gehel:
[search/highlighter@master] Add new formatter to output offsets text snippets

https://gerrit.wikimedia.org/r/371747

This comment was removed by Smalyshev.

Change 375087 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] [WIP] Fix display for fields which have alias in non-target language

https://gerrit.wikimedia.org/r/375087

Change 377962 had a related patch set uploaded (by DCausse; owner: DCausse):
[search/highlighter@5.3] Add new formatter to output offsets text snippets

https://gerrit.wikimedia.org/r/377962

Change 377983 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/software/elasticsearch/plugins@master] Bump highlighter version to 5.3.2.1

https://gerrit.wikimedia.org/r/377983

Change 377962 merged by jenkins-bot:
[search/highlighter@5.3] Add new formatter to output offsets text snippets

https://gerrit.wikimedia.org/r/377962

Change 375087 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Fix display for fields which have alias in non-target language

https://gerrit.wikimedia.org/r/375087

Change 377983 merged by Gehel:
[operations/software/elasticsearch/plugins@master] Bump highlighter version to 5.3.2.1

https://gerrit.wikimedia.org/r/377983

Mentioned in SAL (#wikimedia-operations) [2017-09-19T13:04:35Z] <gehel> upgrading elasticsearch plugins on elastic2001 - T173231

Mentioned in SAL (#wikimedia-operations) [2017-09-19T13:17:04Z] <gehel> upgrading elasticsearch plugins on elasticsearch codfw, including cold restart of the cluster - T173231

Mentioned in SAL (#wikimedia-operations) [2017-09-19T13:19:39Z] <gehel> upgrading elasticsearch plugins on relforge, including cold restart of the cluster - T173231

Mentioned in SAL (#wikimedia-operations) [2017-09-20T15:36:29Z] <gehel> upgrading elasticsearch plugins on elasticsearch eqiad, including cold restart of the cluster - T173231