For Wikibase, we want to add some extra fields, including multilingual content and non-multilingual content:
- labels
- descriptions
- aliases
- entity_type
For helping with rescoring search results:
- sitelink_count
- label_count
- statement_count (possibly)
(and potentially simple statements, such as for looking up items by identifier, which would actually be more simple to implement since it is not multilingual content)
Modify the mapping in Elastic Search to add extra 'fields'
Suggest we use the CirrusSearchMappingConfig hook to add stuff to the mapping, to start with. We can introduce 'field mapping builder' objects that build the mapping data structure for elastic, and as a first step, use these more directly with the hooks. Later, we can perhaps expose an interface in the Content objects that exposes these fields for mapping, and use the 'field mapping builder' objects indirectly.
Populate the extra fields during indexing
Suggest (as a start) that we use the CirrusSearchBuildDocumentParse to have extra stuff indexed when indexing a page. At some point, we may want to add something to EntityContent (and Content generally) to expose these fields (T78011) and implement a way for the SearchEngine implementations to consume these.
For now, I propose we introduce objects that build these data structures for the extra fields, with a generic interface. We can directly use these objects in the hook handlers, or indirectly use them via EntityContent (or just Content). At the same time that we want better integration with EntityContent, it would be nice to have clear separation of the Elastic Search Wikibase code so that it is reusable.
Multilingual indexing
multiple fields by language
"page": {
"dynamic": "false",
"_all": {
"enabled": false
},
"properties": {
"description_de": {
"type": "string"
},
"description_en" {
"type": "string"
},
"description_es": {
"type": "string"
},
"label_de": {
"type": "string"
},
"label_en" {
"type": "string"
},
"label_es": {
"type": "string"
}
}
}pros:
- ...
cons:
- multiple fields has the disadvantage that there would be potentially be a very large number these. (one for every language * three term types)
Nested type
"page": {
"dynamic": "false",
"_all": {
"enabled": false
},
"properties": {
"descriptions": {
"type": "nested",
"properties": {
"de": {
"type": "string"
},
"en": {
"type": "string"
},
"es": {
"type": "string"
}
},
"labels": {
"type": "nested",
"properties": {
"de": {
"type": "string"
},
"en": {
"type": "string"
},
"es": {
"type": "string"
}
}
}
}
}pros:
- ...
cons:
- nested can be a problem when the nesting gets very large, which it would.
- elastic seems to have a problem with multiple (nested) fields with the same name, such as 'en' nested under 'descriptions' and 'en' also nested under labels. Unless there is a workaround, we might have to include a prefix for each language field, such as 'label_en' and "description_en' to disambiguate them.
To start with, this is what I am experimenting with but not convinced this is what we want.
Language-specific child documents
Language specific content (terms) could be split up and stored in child documents.
For language fallback, search / lookup could request a handful of languages and not have to retrieve all child documents.
Pros:
- won't have the large nesting
- if one label is updated, only one child document needs to be updated vs. the entire document / parent, but in practice with Cirrus, not sure it would work this way.
Cons:
- somewhat slower to query
- requires more memory to query the child documents
Searching
We should introduce an EntitySearch (or TermSearch) interface that SearchEntities and other stuff can use.
We can also introduce a TermLookup implementation based on Elastic for things that use TermLookup.
There is some special syntax that can be used when searching with Cirrus, such as insource or incategory.
If we want special syntax for stuff like labels, then we might want a hook added to Cirrus for this. The existing code where the special syntax is handled is very complex and would be good if that was factored out and split up some to make it easier/nicer/less bug-prone to hook into. If there can be a generic interface for this syntax, that would be even nicer.
TODO
- We still need to figure out better how to handle display text.