With the ongoing work on the EntitySchema search feature (T375641), we want to revisit some of the architectural decisions around the search domain based on the recently agreed upon architecture decision to modularize Wikibase. The goal is to first get a good understanding of how the system works currently, and then create a proof of concept that allows EntitySchema to make use of the same search indexing mechanism that is currently used for all other entity types.
In particular, we should look into the following topics for this proof of concept:
- decouple search indexing from EntityDocument
- rethink the search field definitions interface, since all entity types use a subset of the same field definitions anyway
- rethink the mechanism to extract fields data of an entity for the search index. This currently happens in a way that each field gets processed individually by Wikibase through the WikibaseCirrusSearch field definitions. The resulting array is then passed through Wikibase into CirrusSearch, see e.g. EntityHandler::getContentDataForSearchIndex. This seems unintuitive and too many handovers between different extensions. Ideally, we'd like to simplify this, possibly so that Wikibase only needs to provide a single object to WikibaseCirrusSearch, instead of also having to know how to use the field definitions and how to assemble the data array and.
Further reading:
- meeting notes from the search architecture discussion https://docs.google.com/document/d/1eDuEmSzYkZFpq4h6818zQ3E1M614d2QPnOKEFhXUbjc/edit
- entity schema search epic T375641
- entity schema search proof of concept patch chain https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1080044
Guardrails for investigation/prototyping: hard check in after spending a week on this