Page MenuHomePhabricator

Gather data on use of image alt text on Wikimedia wikis
Open, Needs TriagePublic


Gather data on how and how much alt text is used on Wikimedia wikis.

Ideally, this would consist of a database of something like (page id, file name, is from Commons?, is from template?, caption, alt text). Maybe other parameters like size or positioning. (Note that some of those other parameters will influence whether the caption is visible; that's probably pretty important. Also, alt text probably plays a different role on images which link somewhere else.)
Less ideally, just statistics.

Ideally, this would cover all projects and all (content?) pages. More realistically, probably a few projects and a random sampling of pages.

Event Timeline

My original plan was to make a DB query of pages with images, and parse them with mwparserfromhell. As it turns out,

  • There isn't really a way to tell which pages have images in their wikitext markup (as opposed to images via some template). Also no easy way to filter out images which are icons or otherwise probably just a distraction. So, filtering via the imagelinks table doesn't achieve much.
  • mwparserfromhell doesn't have a concept of images. It will parse something like [[Kép:Samoća - panoramio.jpg|300px|bélyegkép|jobbra|Banovinai táj Babina Rijekánál]] as a wikilink with target Kép:Samoća - panoramio.jpg and text 300px|bélyegkép|jobbra|Banovinai táj Babina Rijekánál. It doesn't even help with identifying the namespace.
  • Parsoid HTML is better (although it has a different HTML representation of images with and without captions, both are easily identifiable by typeof="mw:Image/Thumb" and well-documented), but there is no way currently to access it at scale.

Prior art:

FYI, earlier today, I created to play with captions when I was in my playful mood. I ran it at cswiki for featured articles + 100 largest articles smaller than 5000 bytes, and it completed pretty quickly. While it doesn't directly solve your use case, I thought it might be interesting for you.

Because this is an "one evening" kind of code, it has a number of issues, mainly:

  • somehow works only for Czech Wikipedia, because the regexes used for parsing the wikitext heavily depend on wiki language,
  • is not 100% reliable (RE_CAPTION doesn't cover all possible image parameters),
  • it expects an image is inserted directly in the article on its own line.

If the code is useful for you, feel free to use it as you please.

If you wish to get info about images used through a template as well, then Parsoid HTML is likely the only realistic choice, as using wikitext would require either knowing the template's internals or parsing wikitext (both of those scale much worse than Parsoid HTML). While AFAIK there's no dumps for the HTML, there's, and the concurrency limits enforced by that endpoint are pretty considerate ("limit your clients to no more than 200 requests/sec to this API overall"). If few projects + sampled pages is acceptable for you, 200 reqs/s sounds like enough for that purpose.

While AFAIK there's no dumps for the HTML, [...].

It looks I was wrong, at least partially. There's Wikimedia Enterprise, which...provides HTML dumps! Wikimedia Enterprise's API can be accessed via Toolforge already (experimentally verified, although T280631 is not yet resolved). That being said, I have no idea about guarantees (such as, output/server stability) offered at this moment (@LWyatt very likely would).

I quickly skimmed through the dataset, and it sounds to be usable for the purpose of this analytics. I put a copy of the data for skwiki to /home/urbanecm/Documents/11oneTime/02_image_caption_alt_T289443 at Toolforge (but it's really just a decompressed output of wget -O skwiki_json_0.tar.gz '').

That's super useful, thanks for tracking it down @Urbanecm!

tgr/wikipedia-captions was my quick attempt for parsing Parsoid HTML (the tool makes web requests currently but that obviously doesn't scale well).

Miriam pointed me to which covers all wikis and contains almost all relevant information. The only thing I wish it would include is which images come from templates.