Page MenuHomePhabricator

[Story] Implement EntitySearch service on top of Elastic
Closed, ResolvedPublic


The user-facing problem:

  • When a user adds a property to an item, and starts typing the name of the item he wants to link to, a query is run against the wb_terms table to find and rank results, e.g.
  • The current system uses in-memory sorting and very basic scoring for ranking.
  • This results in the search being suboptimal for users, sometimes displaying things in the wrong order or missing important entries
  • This also leads to high database load, with an increasing number of timeouts (and thus, no search result)

Technical issues:

  • The current implementation of the backend for this query is not performant
  • The current implementation lacks a good mechanism for updating term weights (boosts)

Proposed solution:

  • Searching for entities by label should be backed by EntitySearch (or Cirrus) for large wikis.
  • An SQL based search should remain as a fallback/baseline.

This could be implemented using the mechanism proposed in T89733: Allow ContentHandler to expose structured data to the search engine.. However, if we don't want to block on this, it may be simpler to just implement the relevant hook in Cirrus.

Implementation notes, from a brief discussion with Nik:

  • Cirrus already stores and maintains the number incoming links for all entity pages, using the standard mechanism used for wikitext pages as well.
  • labels and aliases should go into new custom fields
  • We can introduce custom fields using the CirrusSearchBuildDocumentParse hook, while T89733 isn't implemented yet.
  • Support for per-language field values can be spoofed by putting the language code as a prefix into the field value (with a separator, perhaps pipe or even linebreak).

See also: T99899: [Story] Looking up entities by external identifiers

Event Timeline

Jonas renamed this task from Implement EntitySearch service on top of Elastic to [Story] Implement EntitySearch service on top of Elastic.Aug 13 2015, 4:43 PM

This seems to be done, isn't it?