Page MenuHomePhabricator

Add an image: count image suggestions without infoboxes
Closed, ResolvedPublic

Description

In T272109: Assess prevalence of Wikidata infoboxes, we counted up the prevalence of Wikidata infoboxes on articles that have image matches. We did that because we think that those articles will need to be excluded from an initial iteration of an "add an image" task. We want to learn a bit more by counting up the prevalence of articles that have any infoboxes. This is because when we place an image in an unillustrated article with an infobox, we want to place it in the infobox. But different infoboxes in different languages are inconsistent in terms of how they label their "image" and "image caption" slots. For our first iteration, rather than figure out how to detect the image slot in each infobox, we think we want to only suggest illustrations for articles that have no infobox at all.

Therefore, we want to calculate these numbers:

  • Total number of articles in the wiki
  • Unillustrated articles in the wiki
  • Articles with match from any source (polished): this means the count of unillustrated articles that have a match from any of the three sources, after the "polishing" steps to remove local images, etc.
  • Have no infobox: of the unillustrated articles with a match, the count of how many of them have no infobox in them.

Here is a table with a sample row showing the output that we want:

wikiTotal number of articlesUnillustrated articlesArticles with match from any source (polished)Have no infobox
frwiki2,000,0001,000,000150,000100,000

The list of wikis for which we want these numbers is:

  • enwiki
  • arwiki
  • kowiki
  • cswiki
  • viwiki
  • frwiki
  • fawiki
  • ptwiki
  • ruwiki
  • trwiki
  • plwiki
  • hewiki
  • svwiki
  • ukwiki
  • huwiki
  • hywiki
  • srwiki
  • euwiki
  • arzwiki
  • cebwiki
  • dewiki
  • bnwiki

Event Timeline

The task is ongoing, but it may take longer than expected.

We were using wikidata - Template:Infobox to filter out articles with infobox by joining the table mediawiki_templatelinks, but we found some articles were not filtered out. For instance:
https://de.wikipedia.org/wiki/Amerikanische_Stachelm%C3%A4use
https://ceb.wikipedia.org/wiki/Australoheros_tavaresi
because they use a different type of infobox template - Template:Taxobox

We're going to generate a list of template name which includes all types of infobox, and then filter out articles with infobox to get the number of 'Have no infobox'.

Update--

The following table is the preliminary result:
https://docs.google.com/spreadsheets/d/1JGDOmZ16L3La-l82rhKAD2IhoaCQfcNICNQ90paH54U

As mentioned in the previous comment, we found there are other templates similar to infobox like Template:Taxobox (Q52496), and the template seems to appear a lot in cebwiki.

@MMiller_WMF I wonder if we also want to exclude articles with taxobox?

Thanks!

Taxoboxes are infoboxes by definition.

@Trizek-WMF thanks! Do you think there is an easy way to retrieve a list of all the major templates used to define infoboxes?

Maybe use "instance of -> Wikimedia infobox template" (Q19887878)?

Thank you, @AikoChou and @Miriam. This is a good start, but I would like to make sure that we are excluding all the infoboxes. For instance, the count says that less that it's less than 1% of unillustrated cebwiki articles. But almost all the pages on that wiki have infoboxes because they were bot created. So perhaps articles there use a template that isn't being included? Here's an example article: https://ceb.wikipedia.org/wiki/Tanacetipathes_longipinnula

Taxobox template used at ceb.wp is connected to its cousins at Wikidata.

I think we have very good chances to exclude infoboxes of all kind, by filtering them down using Q19887878, since the community (as a whole) did a crazy stocktaking work back in the days to connect these templates to Wikidata. We could have a few false positive, but they would only be but to recent, local infoboxes that haven't been properly defined. :)

Or old, little-used infoboxes that no one bothered to connect. In general, I'm not sure minor templates are well-connected to Wikidata.
Maybe there could be a random sample of 5-10 articles per wiki, to visually check that there are indeed no infoboxes?

Great, thanks @Trizek-WMF! So I compiled a list of infoboxes here: https://w.wiki/3nRd
@MMiller_WMF yes, we used the main "infobox" template, but there are more that we should consider. We are working on that!

Hi @MMiller_WMF

https://docs.google.com/spreadsheets/d/1JGDOmZ16L3La-l82rhKAD2IhoaCQfcNICNQ90paH54U

Update -- we excluded all kinds of infobox by filtering using Q19887878. For cebwiki, the count drops from 99% to 3% of unillustrated articles that have no infobox, and the count for other wikis has also dropped.

Thank you, @AikoChou! This is the data we needed. Here's what I am seeing and concluding.

Excluding image matches with infoboxes definitely reduces how many suggestions we can offer, but I don't think it reduces it too much for us to proceed this way for Iteration 1.

In our four pilot wikis, this is how much removing infoboxes reduces the pool, and how many matches remain in the pool:

wikireduction by excluding infoboxesimage matches remaining
cswiki48%19,656
viwiki77%18,802
arwiki83%10,240
bnwiki31%4,968
ptwiki39%6,219
fawiki57%24,110
trwiki33%23,770

Bot-heavy wikis, like viwiki, are heavily impacted because the sorts of articles created by bots (towns, animals, plants) also end up with infoboxes made from Wikidata. There are small numbers of matches available in bnwiki and ptwiki, but there are likely still enough for Iteration 1 -- though the numbers may get too small once the users filter to topics of interest. This is something for us to look at once we have the data integrated into Search (via T285817) and can run those numbers.

@AikoChou @Miriam -- thanks for doing this work. I have a follow-up question.

Recently, a user in Arabic Wikipedia used a bot to illustrate about 20,000 articles with the recommendations from the algorithm. I'd like to know if the data you posted is inclusive of those changes to that wiki. When you ran these numbers, was it for a totally fresh run of the algorithm as of that date? Or did it run off an older set of image recommendations, that may have been before the actions of the bot? The bot did its work on 2021-07-31 and 2021-08-01.

Hi @MMiller_WMF,

The data I posted is not inclusive of those changes. These numbers were calculated based on an older set of image recommendations from 2021-04.

Hi @AikoChou -- thank you for the response. I have one more question: would you be able to post a sample of the image suggestions for just the articles that have no infoboxes, so that we can spot-check to make sure they are just as strong as the ones for articles that do have infoboxes? I think it would be good to have 50 random suggestions (article title, image suggestion, image source, etc.) for each of the wikis that you counted.

Hey @MMiller_WMF,

The following files are samples of the image suggestions for articles that have no infoboxes for each of the wikis we counted. :)