Page MenuHomePhabricator

[REQUEST] Caption and alternative text data related to image files
Closed, ResolvedPublic

Description

Request

Name for main point of contact and contact preference
Fiona Romeo, email or slack

What teams or departments is this for?
GLAM and Culture

How will you use this data or analysis?
For next fiscal year, I'm planning a new public program with GLAMs focused on image description. It would help with the business case for that.
More immediately, it could help to unblock a community debate about alt text implementation. It first started in 2017 (https://phabricator.wikimedia.org/T166094) but revived in the last week with a Wikidata property proposal (https://www.wikidata.org/wiki/Wikidata:Property_proposal/alt_text). I'm also hoping to bring in the perspective of an accessibility expert and people with lived experience.


  • How many pictures on Wikipedia don't have captions?
  • How many pictures on Wikipedia don't have alt text?

They're both handled as MediaWiki markup right now, rather than being saved in a structured way.

Event Timeline

cchen renamed this task from [REQUEST] to [REQUEST] Caption and alternative text data related to image files.Mar 8 2021, 6:11 PM

@tizianopiccardi has extracted this data for January 2021.
Captions on English Wikipedia
If we exclude all gif, tiff and png images, English Wikipedia has 7'811'234 images. Among those, 3'645'913 have a caption: 46.6%.

Alt Text on English Wikipedia: TO DO
We are working on getting the alt text statistics. A first inspection showed that most "alt" field are populated with the image name, e.g.:

<img alt="Osaka07 D6A Betty Heidler Medal1.jpg" src="....
<img alt="Alfred king of Wessex London 880.jpg" src="//....

Tiziano will give me a list of alt text fields and I will work on filtering out these cases so that we can have a reliable estimate.

Beyond English Wikipedia
Due to the many formats/templates used to add images and captions via Wikitext, to extract this information, we are relying on the pages' html. We have html data available for English Wikipedia only for now. Unless we have a better system to extract captions and alt text, extending this analysis for other Wikis will require quite some time, as we need to download the html for all articles in a Wiki. Any help is appreciated!

I would be happy for now to just have the information for English Wikipedia. Thank you very much for your help so far.

Here some additional stats for English Wikipedia:

  • 52.1% of the images have a non-empty alt text
  • 27.0% of the images have a non-empty alt text AND not ending with .jpg, .jpeg, or .png

For this remaining 27%, the quality seems not particularly curated and informative. By looking at random 20 non-empty labels, I noticed cases like 'Refer to caption', or 'A younger man and an older one confer together.' for a picture of John Kennedy and Khrushchev.

I uploaded the dataset here: https://drive.google.com/file/d/1qmz-DHbFjEicwzEWDaSMGTEbW_smLbQq/view?usp=sharing

kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.
kzimmerman added a project: Research.
kzimmerman added a subscriber: kzimmerman.

@Miriam assigning this to @tizianopiccardi given the work that has been done & moving to tracking on Product Analytics; please let me know if there's action my team needs to take. /cc @FRomeo_WMF