Page MenuHomePhabricator

[SPIKE] Image Browsing: Determine how to include relevant images from beyond the immediate page
Closed, ResolvedPublic3 Estimated Story PointsSpike

Description

For the Image Browsing feature, we will populate our initial entry point (the "Carousel" component) with images taken from the article (per T398992). The criteria here are similar to the ones used by MultiMediaViewer.

Once the user reaches the modal / overlay part of the Image Browser, we may want to present them with additional images beyond the ones taken immediately from the article they are viewing to further enrich the experience. This could be especially useful in situations where we have equivalent articles in two different-language wikis, where one wiki's version has only a small number of images while the other has many. It may also be useful to look at Commons or Wikidata to find relevant images.

Questions to answer
  • When is it desirable to pull in additional images beyond the ones present in the original article? Do we do this when we have fewer than X images to display, or do we always check for additional sources to enrich the experience?
  • Can we pull all images from the same article from other Wikipedias?
  • What sources can we try to pull from? Other articles on different wikis? Other projects entirely (Commons/Wikidata/etc)? What APIs do we use?
  • How similar or different is the image data we get from external sources compared to the image data we have about images on the local page the user started from? Caption/description text in particular may present a challenge, because we don't want to show users caption text in a language that they cannot understand. Wikidata may prove useful here.
  • Is there an order of preference we want to follow when pulling data from other sources (i.e. prefer other-language wikis and only go to Commons or Wikidata under certain conditions)? Or should we always query all possible sources?
Acceptance Criteria
  • Decide on a general approach and document it
  • File a task for implementation as necessary

Event Timeline

mfossati subscribed.

WP:BOLD ing this one, as I have strong opinions coming from image suggestions work.

ovasileva renamed this task from Image Browsing: Determine how to include relevant images from beyond the immediate page to [SPIKE] Image Browsing: Determine how to include relevant images from beyond the immediate page.Aug 4 2025, 4:16 PM
ovasileva added a project: Spike.
ovasileva updated the task description. (Show Details)

@JScherer-WMF - this would allow us to pull images in articles which have less than three images (at least starting from other Wikipedias). Let's discuss what this means for the design aspect.

ovasileva set the point value for this task to 3.Aug 4 2025, 4:26 PM
ovasileva moved this task from Incoming/Inbox to Ready on the Reader Growth Team board.
ovasileva lowered the priority of this task from High to Medium.Aug 5 2025, 4:33 PM
ovasileva moved this task from Ready to Sprint 3 on the Reader Growth Team board.
mfossati changed the task status from Open to In Progress.Aug 6 2025, 10:39 AM
mfossati moved this task from Committed to Doing on the Reader Growth Team (Sprint 3) board.

Change #1176445 had a related patch set uploaded (by Marco Fossati; author: Marco Fossati):

[mediawiki/extensions/ReaderExperiments@master] [WIP] ImageBrowing: introduce image suggestions

https://gerrit.wikimedia.org/r/1176445

While @mfossati is doing the actual work, I can add some context:

Questions to answer

  • When is it desirable to pull in additional images beyond the ones present in the original article? Do we do this when we have fewer than X images to display, or do we always check for additional sources to enrich the experience?
    • IMO this is a design question (cc @JScherer-WMF), but IMO this should not depend on the amount of images we already have on page:
      • Either these additional images serve a useful purpose (in which case we should have them on all pages, insofar they're available), or they don't. If they are nothing more than visual filler that serves no useful purpose, they're only distracting from the content that does matter.
      • Things should be consistent/predictable; it's confusing the be expecting one thing based on prior experience, but getting another.
    • If the intent of image browsing is mostly to showcase existing imagery and/or serve as an alternative navigation tool throughout the article, then we should not add images from additional sources. If, however, we're open to the idea of enriching articles with automatically added imagery that is supposedly also relevant (which IMO is a great idea), AND it fits within the design we have in mind, then IMO we should definitely add these to all pages right now and measure engagement. A handful will probably do to start with.
  • Can we pull all images from the same article from other Wikipedias?
    • The Image Suggestions pipeline does this. It finds pairs of (Wikidata) entity ids & Commons images, and one of its sources is image use on Wikipedia articles linked to said entity id (the article's wgWikibaseItemId)
      • These are available from the image suggestions internal API endpoint and are also written to the image page's Elastic document under weighted_tags (as image.linked.from.wikipedia.lead_image/*) & searchable through the custommatch:depicts_or_linked_from=<entity id> keyword
      • Only lead images (i.e. in the article opening; not in sections) are available. One of the reasons for this is that (usually larger) wikis may have more content about certain topics and break them up into multiple distinct pages, in which case images used in another wiki's section of that same topic may not actually be relevant (because the more detailed topic of that section is in another page)
      • Note: this is only for Commons images used on Wikipedias. Local uploads are not captured (but these probably should not be considered anyway because since they're usually local for a reason (i.e. licensing) that may prevent them from being suitable candidates anyway)
  • What sources can we try to pull from? Other articles on different wikis? Other projects entirely (Commons/Wikidata/etc)? What APIs do we use?
    • Relevant sources that are incorporated in the image suggestions pipeline, and available through search/Elastic:
      1. Commons images' "depicts" (P180) statement matching the page's entity id
      2. Commons images' "digital representation of" (P6243) statement matching the page's entity id
      3. Commons images' "main subject of" (P921) statement matching the page's entity id
      4. Commons images used under the "image" (P18) property on the Wikidata entity matching the page's entity id
      5. Commons images in a category used under the "Commons category" (P373) property on the Wikidata entity matching the page's entity id
    • Additional sources that could be considered but would have to be built out:
      • Non-lead images (i.e. within sections) of pages of the same entity on other Wikipedias
      • Expand on #5 above (categories/P373) and include images in subcategories of these categories
      • Expand on #1, #2, #3 & #4 by also including images nested deeper down in the entity hierarchy (traversing "instance of" (P31) or "subclass of" (P279) properties)
  • How similar or different is the image data we get from external sources compared to the image data we have about images on the local page the user started from? Caption/description text in particular may present a challenge, because we don't want to show users caption text in a language that they cannot understand. Wikidata may prove useful here.
    • Getting image data from pages on other wikis where additional media is used is going to be a non-trivial & intensive task that I don't think makes much sense in the first place, since - even if something there is something useful - they're usually a different language (we *could* translate, but that's a pretty big technical undertaking, might rub community the wrong way if automated. It could be very useful to capture additional contextual data for images based on where they're used, but doing it well is likely going to be a ton of work
    • As for what's available from Commons/through existing APIs:
      • title: unlikely to be very useful; many are not very descriptive and they may be in any language
      • wikitext or parsed content: unlikely to be very useful directly since this could contain just about anything, but cfr. "extmetadata" below
      • captions (&prop=pageterms&wbptlanguage=<lang>&wbptterms=label): potentially useful: a file can have an associated caption in potentially every language (but many will be missing)
      • metadata & commonmetadata (&imageinfo&iiprop=metadata|commonmetadata): mostly image EXIF data; likely not useful
      • extmetadata (&imageinfo&iiprop=extmetadata&iiextmetadatalanguage=<lang>): a pretty decent attempt at parsing out the wikitext into useful data, which would allow getting a localized long description from the wikitext (assuming it follows certain established patterns/templates (as all uploads through UploadWizard and a bunch of other tools do by default) and contains that information). Some common other data (among other things, e.g. coordinates, license & attribution) is also available
        • Important caveat: this requires parsing the page and is a slow/intensive task that can take some time (and may time-out if too many are requested at once) - essentially only to be considered if we need this for 1 file at a time, and can afford to wait up to a couple of second for files where a parsed version is not in cache
      • globalusage (&prop=globalusage&gunamespace=1): pages in other wikis where this image is used
    • A bunch more data is available, but I can't really think of anything useful anymore
  • Is there an order of preference we want to follow when pulling data from other sources (i.e. prefer other-language wikis and only go to Commons or Wikidata under certain conditions)? Or should we always query all possible sources?
    • I demonstrated a clear preference for using existing tooling that already attempted to solve the "relevant images for topic" problem, and would suggest to rely on their ordering (image suggestions confidence or search scoring-based sort)

@mfossati Please expand if you remember anything I forgot!

Thanks @mfossati and @matthiasmullie for this exploration and accompanying write-up. I think that this information can inform how we approach T402269 and whatever task we end up filing for follow-up implementation.

@egardner to close. For follow-up ticket, we will focus on images from Wikidata and other Wikipedias, with Wikipedia content coming first