Page MenuHomePhabricator

Create result augmenting procedure for SearchResultsSet
Closed, ResolvedPublic

Description

The API should accept SearchResultsSet as input and after it is done each Result in the set will be augmented with K/V map with additional data for this specific Result (empty by default).

This includes creating API for augmenting each Result with K/V map.

The API will be configured in the way analogous to how ApiQuery configures props - i.e. there will be a map of name/class name and for each entry the class will be instantiated and asked to process an array of Title and return an array of data items, then each data item will be set as augment for the corresponding Result.

Extensions will override/extend it by setting appropriate element in a configuration array.

See https://etherpad.wikimedia.org/p/Wikidata_Meeting_Berlin_10262015

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev added subscribers: Smalyshev, Deskana, daniel.

@Smalyshev Any thoughts on what we should do with this task?

@Deskana this probably needs some thinking about how exactly to do it. I didn't have time yet to work on it but may look into it next week.

@Deskana I've looked into it and I feel it's not defined well enough to start working on it yet. We probably need to talk again to Wikidata folks to have real functional requirements for it.

Deskana changed the task status from Open to Stalled.Mar 16 2016, 3:47 PM
Deskana added a subscriber: Lydia_Pintscher.

This task is quite underdefined at the minute, and working on a solution is a bit premature. Marking as stalled.

Some thoughts from @daniel and @Lydia_Pintscher on this would be welcome. If you don't have the time to think about it because it's not super high priority for you, that's totally fine too; we can just leave this task here until it bubbles up in priority for one us. :-)

@Deskana It would be nice if extensions could attach additional information to SearchResults objects. SearchResult should offer a K/V interface for this, similar to get/setExtensionData in ParserOutput. Extensions should be able to operate on an entire SearchResultsSet, to enable batching.

Use cases would be augmenting search results with interwiki links, geo-coordinates, descriptions from wikidata, thumbnails, etc. MobileFrontend has some hacks for doing this, but it would be nice to have this ability built into core's search interface. Core itself doesn't need to implement any functionality, just a bit of infrastructure for attaching information. Core shouldn't have to know what kind of information is attached, where it comes from, or how it may be displayed.

Is there any more information you need to get this going?

@daniel What I am missing here is the use case - i.e. example of how we would apply such thing. Without having at least one example of what result we want to achieve, it is a bit hard to get the full picture.

@Smalyshev Some use cases:

  1. Descriptions from wikidata in search box suggestions (already possible using API generators with the PageTerms module, but that puts knowledge about extensions into the frontend code)
  2. Thumbnails in Special:Search provided by the PageImages extension
  3. Quality badges (from wikidata or a locally deployed extension) on Special:Search and via the API.

@Smalyshev @daniel I know everyone here is only looking for wikidata related developments but I think it would be not the only extension that can benefit from such formal description. Back in 2015 I did a prototype [0] to apply some post-filter in combination with some augmentation of a resultset but the missing interface makes it more of a hack than an actually implementation.

[0] https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/1233

Some notes after looking into it:

Augmenting SearchResult probably won't work as ResultSet has no obligation to keep persistent SearchResult objects, and in fact CirrusSearch implementation just creates them on-demand. Which means we have to keep the extended data somewhere outside SearchResult. SearchResultSet seems to be a good place, however matching between SearchResult and augmented data and retrieving the data still is a bit challenging. The API may be not as clean as we'd want.

I would propose for now do something like $resultSet->getExtendedData($result). But there is an issue there - since in general we don't have much in the terms of identity for SearchResult. We have Title, which in general is translatable to page ID, which is kind of identity at least within a wiki (we can have multi-wiki search results but fortunately they are kept in separate result sets). However, there is no obligation, in general, for search result to have valid page ID. For such results, it would be hard to identify them with their extended data. I would propose thus to ignore such results for this purpose, at least for now.

@Smalyshev How about augmenting the search results on the fly? I imagine something like this:

  • SearchResult gets a addResultExtender( SearchResultExtender $extender ) method.
  • Extensions can register a SearchResultExtender hook handler that adds extenders to a SearchResult using addResultExtender.
  • SearchResultExtender is an interface that declares the method extendSearchResult( SearchResult $result ).
  • For SearchResult that SearchResult creates, it calls extendSearchResult on every SearchResultExtender that was registered via addResultExtender.

This is a bit elaborate, but has two advantages:

  • We only extend results when actually needed. This may save a lot of cpu cycles.
  • Associating extra data with the SearchResult via some ID is left entirely to the SearchResultExtender implementation.

A SearchResultExtender implementation may want to do batched pre-fetching, either based on the ResultSet or some other information. We'll have to think about which information needs to be passed to the SearchResultExtender hook to allow this.

Alternatively, and much simpler:

Iterator over the ResultSet to fill an array with SearchResult objects. Call a hook that can modify the objects in the array.

This may seem inelegant, but we are basically building such an array anyway when generating output, especially in API modules. Doing this also when generating HTML output shouldn't hurt too much.

Extender should operate on result set, not on individual results, at least it should have this capability. Because otherwise e.g. for page props it would be DB query per result instead of one DB query to fetch all props. So I'm not sure how to make this work with hooks.

Iterator over the ResultSet to fill an array with SearchResult objects. Call a hook that can modify the objects in the array.

What I'd do then with these modified objects? The consumers are still expecting SearchResultSet and using next() on it, and SearchResultSet::next() creates a new SearchResult object for CirrusSearch. We could of course go and rewrite all places where we use SearchResultSet to create array, run hook and only then iterate over the array, but that looks like a lot of work and kind of defeats the purpose of having iterator API in SearchResultSet if we end up recommending not using it at all.

Yes, true.

If the extender should operate on the ResultSet, is it possilbe to iterate over the ResultSet non-destructively? It seems to me that we need a manifest representation of the ResultSet at some stage for this to work.

ResultSet has rewind(), which implies iterator is reentrable, and thus can be iterated more than once. Practically, CirrusSearch implementation allows to rewind without a problem, as far as I can see, and DB implementations which support dataSeek should be able to support it too. So I would assume iterating multiple times over ResultSet is fine (though of course we want to avoid doing it too many times).

Since I assume Title are the actually interesting piece for most augmenters, I consider this route:

  1. Get Title[] from result set
  2. Apply all augmenters to it (I haven't decided yet whether I want to given them Title[] as such or ResultSet with retrievable/cacheable Title[])
  3. Store the augmenting data in ResultSet, somehow linking them to Title's (probably by page ID, but other ideas may work too)
  4. Provide ResultSet::getExtendedData($result) to use by clients

Change 285544 had a related patch set uploaded (by Smalyshev):
[WIP] Infrastructure for augmenting search results

https://gerrit.wikimedia.org/r/285544

Smalyshev changed the task status from Stalled to Open.Jun 24 2016, 11:13 PM

Change 296483 had a related patch set uploaded (by Smalyshev):
Augmenting search results

https://gerrit.wikimedia.org/r/296483

Change 285544 merged by jenkins-bot:
Infrastructure for augmenting search results

https://gerrit.wikimedia.org/r/285544

Change 296483 merged by jenkins-bot:
Augmenting search results

https://gerrit.wikimedia.org/r/296483

@Smalyshev Do the above merged patches resolve this task? It appears so, but I want to be sure before closing.