Page MenuHomePhabricator

Image Browsing: build the API endpoint for images from other wikis
Closed, ResolvedPublic5 Estimated Story Points

Description

In T400823: [SPIKE] Image Browsing: Determine how to include relevant images from beyond the immediate page we decided to fetch images from other Wikipedias and Wikidata via the image suggestions data gateway.
This task aims at scaling up the proof of concept.

NOTE: article-level image suggestions serve Wikipedia articles that are considered to be unillustrated, while section-level ones don't have this constraint.
NOTE: the API endpoint can run on a local MW Docker environment via the host.docker.internal host and an SSH tunnel to a debug box: ssh -t -N mwdebug1002.eqiad.wmnet -L 6030:mwdebug1002.eqiad.wmnet:6030. PatchDemo and production deployments need verification.

Acceptance criteria

  • Start with a check of the outputs of both APIs for validation. Compare results and decide on approach - see T403613#11196775
  • Based on approach selected, update AC if necessary and proceed with building.

Event Timeline

These are available from the image suggestions internal API endpoint and are also written to the image page's Elastic document under weighted_tags (as image.linked.from.wikipedia.lead_image/*) & searchable through the custommatch:depicts_or_linked_from=<entity id> keyword

https://commons.wikimedia.org/w/api.php?action=query&generator=search&gsrsearch=custommatch:depicts_or_linked_from=Q146&gsrnamespace=6&gsrlimit=10&prop=entityterms|globalusage&wbetterms=label&wbetlanguage=fr&gunamespace=0&gulimit=500&gufilterlocal=1

Above is an example where we can already get all data we need through existing resources.
In this example:

  • 10 (gsrlimit=10) images (gsrnamespace=6) for "cat" (gsrsearch=custommatch:depicts_or_linked_from=Q146
  • with the French (wbetlanguage=fr) label (prop=entityterms & wbetterms=label)
  • and the main-namespace articles (gunamespace=0) they're used in (prop=globalusage)

Things to be aware of:

  • this is limited to 500 other-wiki-usage links overall; if result #1 already exceeds 500, the rest will no longer have that information; in such rare cases, more calls might be needed
  • this will may include files that are already on the page & should be filtered out
  • it may also include files used on other pages on the same wiki (e.g. for articles that cover pretty much the same subject), where we might also want to filter them out to avoid confusion
  • it may also include results that are not used in other wikis (custommatch:depicts_or_linked_from also does files with depicts statements); these will usually rank lower than other results, though (and it wouldn't be too much work to put up another keyword where they're eliminated) but I'd suggest to just filter them out client-side for now
ovasileva set the point value for this task to 5.
mfossati changed the task status from Open to In Progress.Sep 18 2025, 9:56 AM
mfossati claimed this task.
mfossati moved this task from Committed to Doing on the Reader Growth Team (Sprint 6) board.

Assessment of APIs outputs

We manually checked 9 random articles from the top 100 views on English Wikipedia, August 2025.
The table below shows the number of images output by the 2 candidate APIs:

RankPageMobile %Data GatewayMediaSearch
1Weapons_(2025_film)79.200
8.xxx99.100
10Sydney_Sweeney8103
11KPop Demon Hunters77.100
13Wednesday (TV series)79.001
16SummerSlam (2025)84.500
17Google87.105
19Ozzy Osbourne83.305
20Taylor Swift77.902

Observations

  • All checked articles have images, so it's not surprising that the Data Gateway has no results. Article-level suggestions only cater for unillustrated articles, and section-level ones aren't many and generally impact smaller articles
  • we filtered images available in the article and empty globalusage from MediaSearch results
  • sometimes MediaSearch offers additional images from equivalent articles in other languages
  • MediaSearch requirements:
    • article's Wikidata entity ID. mw.config.get( 'wgWikibaseItemId' ) can be used to look it up. It's not guaranteed that every article has one, although we expect the coverage to be high
    • filter out non-images - example
    • filter out images from other articles in other languages - example: Google > File:Consultants GAFAM EN.png from GAFAM in frwiki
    • keep images from equivalent articles in other languages - example: Google > File:Googleplex-Patio-Aug-2014.JPG from cawiki

Conclusion

  • The main (anecdotal) conclusion is that the APIs don't differ much when looking at popular articles
  • however, MediaSearch might have some more relevant images from equivalent articles, which looks like the deal breaker
  • this comes at the cost of heavy data cleaning
  • a perfect string match on page titles from other wikis sounds like the quickest way to filter out unwanted results, at the cost of losing relevant ones from titles in different scripts

Decision

Use MediaSearch.

Change #1190684 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[operations/mediawiki-config@master] Add MediaSearch custommatch:linked_from keyword

https://gerrit.wikimedia.org/r/1190684

mfossati moved this task from Doing to Signoff on the Reader Growth Team (Sprint 6) board.

Moving to signoff, as the remaining AC are being addressed in T402966: Image Browsing: Create the UI for "images from other wikis".

Change #1190684 merged by jenkins-bot:

[operations/mediawiki-config@master] Add MediaSearch custommatch:linked_from keyword

https://gerrit.wikimedia.org/r/1190684

Mentioned in SAL (#wikimedia-operations) [2025-09-24T07:18:32Z] <mlitn@deploy1003> Started scap sync-world: Backport for [[gerrit:1190684|Add MediaSearch custommatch:linked_from keyword (T403613)]]

Mentioned in SAL (#wikimedia-operations) [2025-09-24T07:25:49Z] <mlitn@deploy1003> mlitn: Backport for [[gerrit:1190684|Add MediaSearch custommatch:linked_from keyword (T403613)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-09-24T07:31:37Z] <mlitn@deploy1003> Finished scap sync-world: Backport for [[gerrit:1190684|Add MediaSearch custommatch:linked_from keyword (T403613)]] (duration: 13m 04s)