Page MenuHomePhabricator

Index Wikidata labels, aliases and descriptions as separate fields in ElasticSearch
Closed, DuplicatePublic

Description

As discussed in https://etherpad.wikimedia.org/p/Wikidata_Meeting_Berlin_10262015, we may want to index labels, aliases and descriptions as separate fields in ElasticSearch. This involves the following:

  • Separate languages into fallback groups (do not include English)
  • For each fallback group, create index field with labels, aliases and descriptions. For labels and descriptions, store them together with language, e.g. for language group de/de-ch/de-at, we would have:
{
  labels: [{language:"de", text:"foo}, {language:"de-at", text:"foofoo"}],
  aliases: [fuh xyz], 
  descriptions: [{language:"de", text:"foozy"}, {language:"de-ch", text:"foofoo foo zy"}]
}

This structure is to enable both searching in these fields in suitable languages for the user and to enable retrieving labels & descriptions from ElasticSearch instead of making extra trip to the SQL database.

Related Objects

StatusAssignedTask
ResolvedWikidata-bugs
OpenNone
Resolvedaude
ResolvedSmalyshev
ResolvedSmalyshev
DuplicateSmalyshev
ResolvedSmalyshev
InvalidNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse
Resolveddcausse
ResolvedSmalyshev
Resolveddebt
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse
ResolvedSmalyshev

Event Timeline

Smalyshev created this task.Nov 3 2015, 1:52 AM
Smalyshev updated the task description. (Show Details)
Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev added projects: Wikidata, CirrusSearch.
Smalyshev added a subscriber: Smalyshev.
Restricted Application added a project: Discovery. · View Herald TranscriptNov 3 2015, 1:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana moved this task from Uncategorised to Technical on the CirrusSearch board.Dec 31 2015, 3:47 AM
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 31 2015, 3:47 AM
Deskana triaged this task as Normal priority.Dec 31 2015, 3:47 AM
Deskana moved this task from Needs triage to Search on the Discovery board.
Deskana added a subscriber: Deskana.
aude added a comment.Jan 5 2016, 8:53 PM

possible ways of handling multilingual indexing of labels (and other wikibase term types):

Multilingual indexing

multiple fields by language

"page": {
  "dynamic": "false",
  "_all": {
    "enabled": false
  },
  "properties": {
    "description_de": {
      "type": "string"
    },
    "description_en" {
      "type": "string"
    },
    "description_es": {
      "type": "string"
    },
    "label_de": {
      "type": "string"
    },
    "label_en" {
      "type": "string"
    },
    "label_es": {
      "type": "string"
    }
  }
}

pros:

  • ...

cons:

  • multiple fields has the disadvantage that there would be potentially be a very large number these. (one for every language * three term types)

Nested type

"page": {
  "dynamic": "false",
  "_all": {
    "enabled": false
  },
  "properties": {
    "descriptions": {
      "type": "nested",
      "properties": {
        "de": {
          "type": "string"
        },
        "en": {
          "type": "string"
        },
        "es": {
          "type": "string"
        }
    },
    "labels": {
      "type": "nested",
      "properties": {
        "de": {
          "type": "string"
        },
        "en": {
          "type": "string"
        },
        "es": {
          "type": "string"
        }
      }
    }
  }
}

pros:

  • ...

cons:

  • nested can be a problem when the nesting gets very large, which it would.
  • elastic seems to have a problem with multiple (nested) fields with the same name, such as 'en' nested under 'descriptions' and 'en' also nested under labels. Unless there is a workaround, we might have to include a prefix for each language field, such as 'label_en' and "description_en' to disambiguate them.

To start with, this is what I am experimenting with but not convinced this is what we want.

Language-specific child documents

Language specific content (terms) could be split up and stored in child documents.

For language fallback, search / lookup could request a handful of languages and not have to retrieve all child documents.

Pros:

  • won't have the large nesting
  • if one label is updated, only one child document needs to be updated vs. the entire document / parent, but in practice with Cirrus, not sure it would work this way.

Cons:

  • somewhat slower to query
  • requires more memory to query the child documents

I think grouping by fallback group was done to enable more efficient fallback search? I.e. if we want to search something that may have labels in de, de-ch and de-at, we don't want to do three queries. OTOH, if it's a prefix/completion search I don't think we have too much play in terms of conditions, etc. - it's just a string match. So there should be a document that contains these sets I assume.

hoo added a subscriber: hoo.Jan 6 2016, 12:28 AM
aude added a comment.Feb 3 2016, 6:46 PM

i am splitting this task up into:

  1. adding just the labels to start, so we can figure out better how to structure them in elastic and experiemtn more with rescoring and other aspects.
  1. then add descriptions (should then be straightforward since they have the same structure as labels)
  1. add aliases (they have different structure and should figure into search differently)

with all of these, we need to keep language fallback in mind and probably boost the scoring a bit more for fallback languages.

(imho) it is still nice to be able to search for any language (e.g. search for 東京都 even though my interface langauge is english and maybe japanese is not a fallback, but still find the item for Tokyo. this works now with Special:Search)

Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 17 2016, 2:40 PM