Page MenuHomePhabricator

Review MediaSearch profile for integration into CirrusSearch
Closed, ResolvedPublic8 Estimated Story Points

Description

As the maintainers of search, we want a coherent and cohesive code base to make maintenance easier in the long term.

Since the media search profile seems very cohesive with CirrusSearch, we should move that code into the CirrusSearch extension. We need to review that code more closely and see what (if anything) needs to be adapted to be more cohesive. In particular, we probably want to replace external API calls to ES queries (using new sister indices) and have a look at improving ranking (or at least ensure that the prerequisites are there to allow tuning of ranking).

Moving code to CirrusSearch should only be done after analysis.

AC:

  • analysis of media search code
  • list of things that need to be addressed created (as sub tasks to this ticket)

Links to some other MediaSearch related tickets:
T262522, T258054, T258053, T258052, T252692, T258055

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CBogen renamed this task from Review MediaSearch profile and integrate it into CirrusSearch to Review MediaSearch profile for integration into CirrusSearch.Aug 24 2020, 5:30 PM
CBogen updated the task description. (Show Details)
CBogen set the point value for this task to 8.Aug 24 2020, 5:33 PM
dcausse removed dcausse as the assignee of this task.EditedSep 17 2020, 9:18 AM
dcausse moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.
dcausse subscribed.

There are still work being done/discussed on this query builder that it seems too early to fully address this task.

Here are the few points identified/discussed so far.

First are the technical aspects that we will to have to address anyways before using the builder more broadly:

  • wikibase dependencies: in order to use the builder on every wiki we will have to cut the dependencies to wikibase so that it can be moved to the CirrusSearch extension:
    • language fallbacks used to select the proper description fields to query
    • nit: implicit cyclic dependency via field names (WikibaseMediaInfo depends on Cirrus but Cirrus will depend on field names maintained by WikibaseMediaInfo). Can be addressed by making the builder abstract enough that the field names are owned by the configuration.
  • wikidata search API call
    • this causes extra latency that we might want to address if we want to expose this functionality more broadly (esp. if applied on every SpecialSearch requests via the Multimedia side bar)

Here are other aspects related to the quality of the results:

  • real "query rewriting/expansion" this the first time we have a concrete usecase for this, the current approach is to reuse wikidata APIs to do a "concept match" and then use its results to combine in a classic elasticsearch queries
    • The wikidata search APIs are not designed for this, one is for type-ahead lookups the other is for high recall fulltext searches. There is certainly room for improvement here by designing something specifically made to feed a query reformulation.
    • some research on existing work around concept based query expansion might be worth looking into
  • Balancing text and concept matches is proven to be tricky, here are few ideas that are being tried/discussed
    • flatten the text scores using sqrt
    • quickly experiment using a naive tokenizer in PHP and see if the number of tokens can be used to normalize the concept scores and/or the text field scores
    • possibly write a new elasticsearch query that can combine and use in its scoring formula the number of tokens of its input query
    • use the token_count_router to route to a different set of queries based on the number of tokens
  • quality of the depict statements, in general when scoring docs we want to know the relative importance of the searched term/concept respectively to other searched terms/concepts (weight of the term, for the concept we approximate this using the rank and text snippet returned by the wikidata search API) but also the importance of the search terms/concepts within the doc being scored. For terms: its frequency in the doc is used, for concepts we could use the "prominence" (preferred rank) of the depicted concept. A normalization factor could also be used (matching one depict out of one is perhaps better than matching 1 out of 10).
  • difficulty to assess the quality of the improvements, the current tooling offered by relevance forge is not designed for image search. Better tooling and judgement lists might be required to drive the changes to this builder

I'm moving this task to "waiting" while ideas are still being experimented by the SD team to the existing builder.

possibly write a new elasticsearch query that can combine and use in its scoring formula the number of tokens of its input query

Actually, we can do this with the ltr explore query today.

{
  "query": {
    "match_explorer": {
      "query": { "match": { "text.plain": { "query": "example words to count" } } },
      "type": "unique_terms_count"
    }
  },
  "size": 10,
  "_source": false
}

Another thing to consider would be T258055: [L] [SPIKE] Investigate traversing entities tree to include more entities with more detail & feedback is very much welcome (scroll to last post for findings)

@dcausse @matthiasmullie can we identify which tickets/experiments need to be done before the search team can get back to this task? I'd like to define a point when it's safe to make the handover.

The relevant tickets are already in the ticket description; I can't think of any others not already listed.

Gehel triaged this task as High priority.Oct 28 2020, 1:28 PM

In addition to the previous recommendations:

  • T252692 (search syntax and features) would benefit from completing the refactoring started in T185108. The MediaSearch query builder will be able to use parts of this work (the cirrus query AST) but will have to implement its own query components, depending on how both tasks advance there will be consolidation work to be done.
  • T258055 (traversing the wikidata graph) is still being evaluated and its scope is relatively large so I don't think it's worth spending too much time on possible solutions but I believe this work might be integrated in a component that extract a subset of wikidata (possibly denormalized).

Summarizing what has been said so far I see:

  • T268648: MediaSearch should use a dedicated service/query for doing its concept-lookup instead of the wikidata search API. This should encompass the work regarding rewriting (T258053 & T258052). But also the traversal of the wikidata graph (explored in T258055) even though this one is more complicated. Overall everything that requires joining wikidata or exploring the structure of the wikidata graph should be doable by constructing dedicated datasets offline.
  • Work on refactoring the cirrus parsing can happen concurrently/later and can be re-prioritized depending on the possible blocking points that might be encountered on T252692. Supporting more complex syntax in the MediaSearch query builder will probably involve writing code that would have been better placed in CirrusSearch but we agreed to consolidate this when resuming work on refactoring the query building logic (T185108).
  • If I understood correctly there are currently no plans to deploy MediaSearch on wikis other than commons and thus there is no longer the need to move this code out of the MediaInfo extension to break the wikibase dependencies (language fallbacks) therefor I don't think it's worth creating a ticket for this yet.
  • Ranking: I think a lot of exploratory work has been done already by the SD team (e.g. T258054 and T262522). At this point I don't have specific recommendations to make except that the tooling around relevancy should be adapted to the specifics of MediaSearch and created T268653.
  • Feature engineering: I think the features that the SD team wanted to explore have been captured in the tickets referenced in the description. There might be other features interesting for MediaSearch but without a better understanding of how to assess the quality of the results I think it is too early to discuss about this.

Thanks for this summary @dcausse!

  • T268648: MediaSearch should use a dedicated service/query for doing its concept-lookup instead of the wikidata search API. This should encompass the work regarding rewriting (T258053 & T258052). But also the traversal of the wikidata graph (explored in T258055) even though this one is more complicated. Overall everything that requires joining wikidata or exploring the structure of the wikidata graph should be doable by constructing dedicated datasets offline.

Does this mean that we should wait to do T258053 and T258052 until T268648 is complete? Or should we continue work on those tickets and then refactor once T268648 is complete?

  • Work on refactoring the cirrus parsing can happen concurrently/later and can be re-prioritized depending on the possible blocking points that might be encountered on T252692. Supporting more complex syntax in the MediaSearch query builder will probably involve writing code that would have been better placed in CirrusSearch but we agreed to consolidate this when resuming work on refactoring the query building logic (T185108).

Should we put a note in T185108 about this so we don't forget whenever we return to it?

  • If I understood correctly there are currently no plans to deploy MediaSearch on wikis other than commons and thus there is no longer the need to move this code out of the MediaInfo extension to break the wikibase dependencies (language fallbacks) therefor I don't think it's worth creating a ticket for this yet.

This is correct. One day we might expand beyond Commons to search the fair use images stored on other wikis, but there are no plans for this yet and I think we can create a ticket if that becomes a use case we need to support.

Does this mean that we should wait to do T258053 and T258052 until T268648 is complete? Or should we continue work on those tickets and then refactor once T268648 is complete?

The features you planned to explore/implement should not be blocked on T268648. We should just coordinate while working on these tickets (i.e. reviews) so that we can discuss & adapt.

  • Work on refactoring the cirrus parsing can happen concurrently/later and can be re-prioritized depending on the possible blocking points that might be encountered on T252692. Supporting more complex syntax in the MediaSearch query builder will probably involve writing code that would have been better placed in CirrusSearch but we agreed to consolidate this when resuming work on refactoring the query building logic (T185108).

Should we put a note in T185108 about this so we don't forget whenever we return to it?

Good idea, I've added a note.