Page MenuHomePhabricator

[EPIC] Adapt relevance forge tools for MediaSearch
Open, HighPublic

Description

As a search engineer I want most of the tools provided by relevance forge to be compatible/friendlier to image search so that I can assess/anticipate changes to MediaSearch.

MediaSearch provides its image results in a grid but relevance forge has been designed for working on text based search results.
We should adapt some of the tooling for grid layouts:

  • determine if it makes sense to have a good way to present diffs for grid results
  • research what metrics would make more sense to evaluate grid results
  • possibly start collecting and grading a set of set query -> results

AC:

  • Have a tool that allows to assess the impact of a change on MediaSearch
  • Have a better understanding of what metric could be used to evaluate grid-based results
  • Decide if collecting and grading a query set is worth the effort

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse renamed this task from Adapt relance forge tools for MediaSearch to Adapt relevance forge tools for MediaSearch.Nov 24 2020, 5:32 PM

adding @Miriam and @nettrom_WMF because we've had some prior discussions about evaluating grid results. Previously we've decided not to take position in the grid into account because there's no agreed upon way to do so, but maybe things have changed.

EBernhardson moved this task from needs triage to elastic / cirrus on the Discovery-Search board.
CBogen raised the priority of this task from Medium to High.Dec 10 2020, 3:09 PM

Just on AC number 3 - we (structured data) are currently kinda doing this using https://media-search-signal-test.toolforge.org/

I took a list of the top 1000-odd queries on the commonswiki index (that I had got from superset a few months ago), plus a list of 1000 random queries I got off @TJones , and ran mediasearch for each one with the bost for each field except one set to zero (and the exception had its boost set to 1). Recorded the score and position of each image in the top 100 and saved them, and that's what's being used to supply the images for the tool at the url above

So now we have a list of images, with a search term used to the find them and the elasticsearch score returned for an individual elasticsearch field, with a rating of -1 (bad), 0 (neutral) or 1 (good). 7777 ratings so far, hoping to get to 1000 ratings per search signal (auxiliary_text, caption, category, heading, redirect.title, suggest, text, title, statement_keywords (we tried file_text too but almost all the scores returned were zero))

@dcausse tbh I don't know exactly what relforge does, but as you know we've gathered a set of labeled search results, and written some code to compare resultsets from different search algorithms (see T271801). Will this ticket add anything to that?

@Cparle I think the idea would be to consolidate what you've built with the existing tools the search team uses. Few examples:

  • there's a runSearch.php in CirrusSearch whose purpose is close to runSearches.php you wrote
  • there are scripts to compute some scores based on a labelled dataset relforge_engine_score/scorers.py but does not implement an F measure you seem to use

repo: https://gerrit.wikimedia.org/g/wikimedia/discovery/relevanceForge

MPhamWMF renamed this task from Adapt relevance forge tools for MediaSearch to [EPIC] Adapt relevance forge tools for MediaSearch.Mar 15 2021, 3:32 PM
MPhamWMF moved this task from Incoming to Epics on the Discovery-Search (Current work) board.

@matthiasmullie @Cparle The search team is trying to understand why SD chose to build their own tools instead of using the existing tools. They want to own one set of tooling only, and so they need to understand what was missing from the previously existing tools, so they know how to improve (or whether to replace) the existing tools (e.g. with Rated Ranking Evaluator). Can you provide some input? Thanks!

We just didn't realise that it was possible to compute precision/recall/etc scores based on a labeled dataset using relforge. The readme mostly covers how to generate diffs of results from applying different config options to sets of queries. I talked about relforge a bit with the search team, but I guess I didn't really understand exactly what I needed a tool to do until I had written a new one, and so failed to ask them the right questions. FWIW we are not attached to our own tool at all and are more than happy to switch over to using relforge instead (once we figure out how)