Page MenuHomePhabricator

[SPIKE] Gather labeled data to re-tune MediaSearch
Open, Needs TriagePublic

Description

Re-use T293878: [L] Gather labeled data relevant to synonyms and build a dataset to update MediaSearch's index.
Implementation pointers at T293878#7825473.

Data

  • confirm sampling of search queries based on traffic, e.g., https://github.com/marfox/image-search
  • curate queries
    • look into languages
    • make sure enough ones activate all code paths, e.g., Wikidata query expansion on labels / aliases / depicts
      • including queries that active code paths in different ways, e.g. for "Martin Luther King Jr" 4 words correspond to 1 wikidata item, but "toucan in amazon rainforest" has 2 different items in the 4 words (actually not sure how we handle this in the backend)
      • including queries that, if used on wikipedias, would give us "fair use" results - e.g. "bruce springsteen live" might give this image which is not on Commons but is on enwiki
  • ensure positive and negative image samples by setting reasonable top-K results, e.g.,50 is more likely to include negative rather than 5
  • ensure at least N positive/negative samples per query
  • confirm optimization metric, e.g., nDCG - focus on precision@K and avoiding zero-results (i.e., recall)
  • what can zero-results tell us?

Interface

Details

Event Timeline

Note that for an intersection type query we still really lag behind google search, e.g.

Change #1171239 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[operations/mediawiki-config@master] Add new MediaSearch config/coefficients

https://gerrit.wikimedia.org/r/1171239

Change #1171239 merged by jenkins-bot:

[operations/mediawiki-config@master] Add new MediaSearch config/coefficients

https://gerrit.wikimedia.org/r/1171239

Mentioned in SAL (#wikimedia-operations) [2025-07-30T07:55:59Z] <mlitn@deploy1003> Started scap sync-world: Backport for [[gerrit:1171239|Add new MediaSearch config/coefficients (T385286)]]

Mentioned in SAL (#wikimedia-operations) [2025-07-30T07:58:22Z] <mlitn@deploy1003> mlitn: Backport for [[gerrit:1171239|Add new MediaSearch config/coefficients (T385286)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-07-30T08:05:41Z] <mlitn@deploy1003> Finished scap sync-world: Backport for [[gerrit:1171239|Add new MediaSearch config/coefficients (T385286)]] (duration: 09m 42s)