Review MediaSearch profile for integration into CirrusSearch
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	Gehel
	Aug 12 2020, 3:20 PM

Description

As the maintainers of search, we want a coherent and cohesive code base to make maintenance easier in the long term.

Since the media search profile seems very cohesive with CirrusSearch, we should move that code into the CirrusSearch extension. We need to review that code more closely and see what (if anything) needs to be adapted to be more cohesive. In particular, we probably want to replace external API calls to ES queries (using new sister indices) and have a look at improving ranking (or at least ensure that the prerequisites are there to allow tuning of ranking).

Moving code to CirrusSearch should only be done after analysis.

AC:

analysis of media search code
list of things that need to be addressed created (as sub tasks to this ticket)

Links to some other MediaSearch related tickets:
T262522, T258054, T258053, T258052, T252692, T258055

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T257043 [Epic] Integrate MediaSearch into CirrusSearch and align it with the current Search best practices
		Resolved		dcausse	T260251 Review MediaSearch profile for integration into CirrusSearch

Event Timeline

Gehel created this task.Aug 12 2020, 3:20 PM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptAug 12 2020, 3:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Gehel updated the task description. (Show Details)Aug 12 2020, 3:24 PM

CBogen edited projects, added SDAW-MediaSearch (MediaSearch-ReleaseCandidate); removed SDAW-MediaSearch.Aug 12 2020, 3:25 PM

CBogen added a subscriber: Cparle.

CBogen moved this task from Triage to Tracking on the Structured-Data-Backlog board.Aug 12 2020, 3:27 PM

Gehel updated the task description. (Show Details)Aug 13 2020, 7:10 PM

CBogen moved this task from needs triage to Current work on the Discovery-Search board.Aug 13 2020, 7:11 PM

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

CBogen renamed this task from Review MediaSearch profile and integrate it into CirrusSearch to Review MediaSearch profile for integration into CirrusSearch.Aug 24 2020, 5:30 PM

CBogen updated the task description. (Show Details)

CBogen set the point value for this task to 8.Aug 24 2020, 5:33 PM

CBogen moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

CBogen added a parent task: T257043: [Epic] Integrate MediaSearch into CirrusSearch and align it with the current Search best practices.Aug 26 2020, 10:16 PM

Gehel updated the task description. (Show Details)Sep 1 2020, 11:07 AM

CBogen added a subscriber: matthiasmullie.Sep 10 2020, 5:42 PM

• Ramsey-WMF mentioned this in T262522: Strike a decent balance between fulltext matches & statement matches.Sep 10 2020, 5:50 PM

CBogen updated the task description. (Show Details)Sep 10 2020, 5:51 PM

dcausse claimed this task.Sep 14 2020, 10:14 AM

dcausse moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

There are still work being done/discussed on this query builder that it seems too early to fully address this task.

Here are the few points identified/discussed so far.

First are the technical aspects that we will to have to address anyways before using the builder more broadly:

wikibase dependencies: in order to use the builder on every wiki we will have to cut the dependencies to wikibase so that it can be moved to the CirrusSearch extension:
- language fallbacks used to select the proper description fields to query
- nit: implicit cyclic dependency via field names (WikibaseMediaInfo depends on Cirrus but Cirrus will depend on field names maintained by WikibaseMediaInfo). Can be addressed by making the builder abstract enough that the field names are owned by the configuration.
wikidata search API call
- this causes extra latency that we might want to address if we want to expose this functionality more broadly (esp. if applied on every SpecialSearch requests via the Multimedia side bar)

Here are other aspects related to the quality of the results:

real "query rewriting/expansion" this the first time we have a concrete usecase for this, the current approach is to reuse wikidata APIs to do a "concept match" and then use its results to combine in a classic elasticsearch queries
- The wikidata search APIs are not designed for this, one is for type-ahead lookups the other is for high recall fulltext searches. There is certainly room for improvement here by designing something specifically made to feed a query reformulation.
- some research on existing work around concept based query expansion might be worth looking into
Balancing text and concept matches is proven to be tricky, here are few ideas that are being tried/discussed
- flatten the text scores using sqrt
- quickly experiment using a naive tokenizer in PHP and see if the number of tokens can be used to normalize the concept scores and/or the text field scores
- possibly write a new elasticsearch query that can combine and use in its scoring formula the number of tokens of its input query
- use the token_count_router to route to a different set of queries based on the number of tokens
quality of the depict statements, in general when scoring docs we want to know the relative importance of the searched term/concept respectively to other searched terms/concepts (weight of the term, for the concept we approximate this using the rank and text snippet returned by the wikidata search API) but also the importance of the search terms/concepts within the doc being scored. For terms: its frequency in the doc is used, for concepts we could use the "prominence" (preferred rank) of the depicted concept. A normalization factor could also be used (matching one depict out of one is perhaps better than matching 1 out of 10).
difficulty to assess the quality of the improvements, the current tooling offered by relevance forge is not designed for image search. Better tooling and judgement lists might be required to drive the changes to this builder

I'm moving this task to "waiting" while ideas are still being experimented by the SD team to the existing builder.

possibly write a new elasticsearch query that can combine and use in its scoring formula the number of tokens of its input query

Actually, we can do this with the ltr explore query today.

{
  "query": {
    "match_explorer": {
      "query": { "match": { "text.plain": { "query": "example words to count" } } },
      "type": "unique_terms_count"
    }
  },
  "size": 10,
  "_source": false
}

Another thing to consider would be T258055: [L] [SPIKE] Investigate traversing entities tree to include more entities with more detail & feedback is very much welcome (scroll to last post for findings)

@dcausse @matthiasmullie can we identify which tickets/experiments need to be done before the search team can get back to this task? I'd like to define a point when it's safe to make the handover.

The relevant tickets are already in the ticket description; I can't think of any others not already listed.

CBogen moved this task from Waiting to Blocked/Waiting on the Discovery-Search (Current work) board.Sep 28 2020, 5:31 PM

CBogen moved this task from Blocked/Waiting to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Oct 26 2020, 6:22 PM

Gehel triaged this task as High priority.Oct 28 2020, 1:28 PM

dcausse claimed this task.Nov 23 2020, 3:43 PM

dcausse moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

In addition to the previous recommendations:

T252692 (search syntax and features) would benefit from completing the refactoring started in T185108. The MediaSearch query builder will be able to use parts of this work (the cirrus query AST) but will have to implement its own query components, depending on how both tasks advance there will be consolidation work to be done.
T258055 (traversing the wikidata graph) is still being evaluated and its scope is relatively large so I don't think it's worth spending too much time on possible solutions but I believe this work might be integrated in a component that extract a subset of wikidata (possibly denormalized).

Summarizing what has been said so far I see:

T268648: MediaSearch should use a dedicated service/query for doing its concept-lookup instead of the wikidata search API. This should encompass the work regarding rewriting (T258053 & T258052). But also the traversal of the wikidata graph (explored in T258055) even though this one is more complicated. Overall everything that requires joining wikidata or exploring the structure of the wikidata graph should be doable by constructing dedicated datasets offline.
Work on refactoring the cirrus parsing can happen concurrently/later and can be re-prioritized depending on the possible blocking points that might be encountered on T252692. Supporting more complex syntax in the MediaSearch query builder will probably involve writing code that would have been better placed in CirrusSearch but we agreed to consolidate this when resuming work on refactoring the query building logic (T185108).
If I understood correctly there are currently no plans to deploy MediaSearch on wikis other than commons and thus there is no longer the need to move this code out of the MediaInfo extension to break the wikibase dependencies (language fallbacks) therefor I don't think it's worth creating a ticket for this yet.
Ranking: I think a lot of exploratory work has been done already by the SD team (e.g. T258054 and T262522). At this point I don't have specific recommendations to make except that the tooling around relevancy should be adapted to the specifics of MediaSearch and created T268653.
Feature engineering: I think the features that the SD team wanted to explore have been captured in the tickets referenced in the description. There might be other features interesting for MediaSearch but without a better understanding of how to assess the quality of the results I think it is too early to discuss about this.

Thanks for this summary @dcausse!

In T260251#6648172, @dcausse wrote:

T268648: MediaSearch should use a dedicated service/query for doing its concept-lookup instead of the wikidata search API. This should encompass the work regarding rewriting (T258053 & T258052). But also the traversal of the wikidata graph (explored in T258055) even though this one is more complicated. Overall everything that requires joining wikidata or exploring the structure of the wikidata graph should be doable by constructing dedicated datasets offline.

Does this mean that we should wait to do T258053 and T258052 until T268648 is complete? Or should we continue work on those tickets and then refactor once T268648 is complete?

Work on refactoring the cirrus parsing can happen concurrently/later and can be re-prioritized depending on the possible blocking points that might be encountered on T252692. Supporting more complex syntax in the MediaSearch query builder will probably involve writing code that would have been better placed in CirrusSearch but we agreed to consolidate this when resuming work on refactoring the query building logic (T185108).

Should we put a note in T185108 about this so we don't forget whenever we return to it?

If I understood correctly there are currently no plans to deploy MediaSearch on wikis other than commons and thus there is no longer the need to move this code out of the MediaInfo extension to break the wikibase dependencies (language fallbacks) therefor I don't think it's worth creating a ticket for this yet.

This is correct. One day we might expand beyond Commons to search the fair use images stored on other wikis, but there are no plans for this yet and I think we can create a ticket if that becomes a use case we need to support.

In T260251#6649364, @CBogen wrote:

Does this mean that we should wait to do T258053 and T258052 until T268648 is complete? Or should we continue work on those tickets and then refactor once T268648 is complete?

The features you planned to explore/implement should not be blocked on T268648. We should just coordinate while working on these tickets (i.e. reviews) so that we can discuss & adapt.

Work on refactoring the cirrus parsing can happen concurrently/later and can be re-prioritized depending on the possible blocking points that might be encountered on T252692. Supporting more complex syntax in the MediaSearch query builder will probably involve writing code that would have been better placed in CirrusSearch but we agreed to consolidate this when resuming work on refactoring the query building logic (T185108).

Should we put a note in T185108 about this so we don't forget whenever we return to it?

Good idea, I've added a note.

Gehel moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Dec 7 2020, 6:08 PM

Gehel closed this task as Resolved.Dec 14 2020, 2:04 PM

Review MediaSearch profile for integration into CirrusSearchClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Review MediaSearch profile for integration into CirrusSearch
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...