Page MenuHomePhabricator

[Task] Spec out how to add extra Wikibase-specific fields in Cirrus / elastic search
Closed, ResolvedPublic

Description

For Wikibase, we want to add some extra fields, including multilingual content and non-multilingual content:

  • labels
  • descriptions
  • aliases
  • entity_type

For helping with rescoring search results:

  • sitelink_count
  • label_count
  • statement_count (possibly)

(and potentially simple statements, such as for looking up items by identifier, which would actually be more simple to implement since it is not multilingual content)

Modify the mapping in Elastic Search to add extra 'fields'

Suggest we use the CirrusSearchMappingConfig hook to add stuff to the mapping, to start with. We can introduce 'field mapping builder' objects that build the mapping data structure for elastic, and as a first step, use these more directly with the hooks. Later, we can perhaps expose an interface in the Content objects that exposes these fields for mapping, and use the 'field mapping builder' objects indirectly.

Populate the extra fields during indexing

Suggest (as a start) that we use the CirrusSearchBuildDocumentParse to have extra stuff indexed when indexing a page. At some point, we may want to add something to EntityContent (and Content generally) to expose these fields (T78011) and implement a way for the SearchEngine implementations to consume these.

For now, I propose we introduce objects that build these data structures for the extra fields, with a generic interface. We can directly use these objects in the hook handlers, or indirectly use them via EntityContent (or just Content). At the same time that we want better integration with EntityContent, it would be nice to have clear separation of the Elastic Search Wikibase code so that it is reusable.

Multilingual indexing

multiple fields by language

"page": {
  "dynamic": "false",
  "_all": {
    "enabled": false
  },
  "properties": {
    "description_de": {
      "type": "string"
    },
    "description_en" {
      "type": "string"
    },
    "description_es": {
      "type": "string"
    },
    "label_de": {
      "type": "string"
    },
    "label_en" {
      "type": "string"
    },
    "label_es": {
      "type": "string"
    }
  }
}

pros:

  • ...

cons:

  • multiple fields has the disadvantage that there would be potentially be a very large number these. (one for every language * three term types)

Nested type

"page": {
  "dynamic": "false",
  "_all": {
    "enabled": false
  },
  "properties": {
    "descriptions": {
      "type": "nested",
      "properties": {
        "de": {
          "type": "string"
        },
        "en": {
          "type": "string"
        },
        "es": {
          "type": "string"
        }
    },
    "labels": {
      "type": "nested",
      "properties": {
        "de": {
          "type": "string"
        },
        "en": {
          "type": "string"
        },
        "es": {
          "type": "string"
        }
      }
    }
  }
}

pros:

  • ...

cons:

  • nested can be a problem when the nesting gets very large, which it would.
  • elastic seems to have a problem with multiple (nested) fields with the same name, such as 'en' nested under 'descriptions' and 'en' also nested under labels. Unless there is a workaround, we might have to include a prefix for each language field, such as 'label_en' and "description_en' to disambiguate them.

To start with, this is what I am experimenting with but not convinced this is what we want.

Language-specific child documents

Language specific content (terms) could be split up and stored in child documents.

For language fallback, search / lookup could request a handful of languages and not have to retrieve all child documents.

Pros:

  • won't have the large nesting
  • if one label is updated, only one child document needs to be updated vs. the entire document / parent, but in practice with Cirrus, not sure it would work this way.

Cons:

  • somewhat slower to query
  • requires more memory to query the child documents

Searching

We should introduce an EntitySearch (or TermSearch) interface that SearchEntities and other stuff can use.

We can also introduce a TermLookup implementation based on Elastic for things that use TermLookup.

There is some special syntax that can be used when searching with Cirrus, such as insource or incategory.

If we want special syntax for stuff like labels, then we might want a hook added to Cirrus for this. The existing code where the special syntax is handled is very complex and would be good if that was factored out and split up some to make it easier/nicer/less bug-prone to hook into. If there can be a generic interface for this syntax, that would be even nicer.

TODO

  • We still need to figure out better how to handle display text.

Event Timeline

Tobi_WMDE_SW raised the priority of this task from to High.
Tobi_WMDE_SW updated the task description. (Show Details)
Tobi_WMDE_SW added a subscriber: aude.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 3 2015, 2:36 PM
Tobi_WMDE_SW updated the task description. (Show Details)
Tobi_WMDE_SW set Security to None.
aude claimed this task.Nov 4 2015, 7:49 PM
aude moved this task from Backlog to Doing on the Wikidata-Sprint-2015-11-03 board.
Lydia_Pintscher moved this task from incoming to in progress on the Wikidata board.Nov 8 2015, 5:04 PM
aude added a comment.EditedNov 16 2015, 3:02 PM

(moving this to description)

aude added a comment.EditedNov 16 2015, 3:07 PM

Possible next steps are:

  • more detailed implementation (perhaps still exploratory) of field mapping builder objects, for exposing the fields to the mapping
  • more detailed implementation of field indexer objects for populating the fields during indexing

TBD: to use them more directly with the available hooks or add some indirect methods/interface in the Content objects. I think it would be quite some more time to get the search code in core and Cirrus to injest these fields in a nice way other than the hooks that exist now.

I think the most important next step is to spec out an EntitySearch (or TermSearch) interface, as this is what SearchEntities uses (but the code now is in TermSqlIndex). In exploring this already, it becomes more obvious that we may also want to index meta info like entity type. (although we also know the namespace of the page)

We also need to figure out how to integrate better with searching in Cirrus. (e.g. defining extra search keywords, like 'intitle')

aude renamed this task from [Task] Find the best way to get labels into Elastic to [Task] Spec out how to add extra Wikibase-specific fields in Cirrus / elastic search.Nov 17 2015, 5:56 PM
aude updated the task description. (Show Details)
aude added a comment.Nov 19 2015, 1:57 PM

I am going to start implementation with adding statistics fields (e.g. sitelink count) to the index for each page, so that search results can take these into account when ranking and rescoring to help improve usefulness of search results.

With this, we will get some mechanism to add things to Cirrus via the currently available hooks. These fields are non-multilingual so I expect it to be more straightforward to do this and it helps towards resolving T110648.

I will continue to also poke at my code experiments for this task to figure out how better we can handle labels etc. and incorporate them better in search, but welcome comments on what I already proposed and have done. :)

aude moved this task from Doing to Review on the Wikidata-Sprint-2015-11-17 board.Nov 19 2015, 1:58 PM
daniel added a subscriber: daniel.Nov 25 2015, 4:28 PM

@aude For feedback on your code, it would be useful if i could comment inline. I see no good way to do this currently, do you have an idea?

aude updated the task description. (Show Details)Dec 1 2015, 10:31 AM
aude updated the task description. (Show Details)
aude added a comment.Dec 1 2015, 10:39 AM

@aude For feedback on your code, it would be useful if i could comment inline. I see no good way to do this currently, do you have an idea?

I will have to submit specific stuff through the normal code review process, such as https://gerrit.wikimedia.org/r/#/c/256023/

If you have comments on the more general ideas, such as how to structure the mapping or my code generally, then those comments can be here on phabricator.

thiemowmde closed this task as Resolved.Dec 17 2015, 10:41 AM
thiemowmde moved this task from Review to Done on the Wikidata-Sprint-2015-12-01 board.
thiemowmde added a subscriber: thiemowmde.