Page MenuHomePhabricator

A map of visual knowledge gaps
Closed, ResolvedPublic

Description

Background
Wikimedia Commons contains 65 million images, but in many Wikipedias, over 50% of articles have no images. But how can we exactly quantify the areas of content where we are missing more images, vs areas with disproportionate amounts of visual content? What about the quality of the existing content? And how to evaluate whether the missing content is already somewhere in Wikimedia projects? The map of visual knowledge gaps can help with this.

Metrics

  • Number of articles/items Missing images - this reflects the amount of missing content
  • Number of existing images per article/item - this reflects the proportion of visual content
  • [Stretch] Average Image quality
  • [Stretch] Proportion of articles having missing images which have an easy fix (i.e., image can be acquired through Wikidata)

Dimensions of analysis - we will be break down the analysis of the metrics above by the following dimensions:

  • By Wikipedia project (languages divided by emerging/established labels from product), [stretch: Wikidata]
  • By topic of the article or class of the corresponding Wikidata item
  • By article length
  • [stretch] By image type (need to design a classifier for that)

Event Timeline

Just copying over my comment from T259365:

This may be covered by #2-A[stretch] - Wikidata, but I'd love to see how many wikidata items are missing images. We could then see the number of articles with missing images that have a corresponding wikidata item with no image, which might tell us whether it's worthwhile to explore tools that make adding images to wikidata itself easier.

And @Abit's reply:

That's an interesting idea...infoboxes can pull images from WD, so adding images to WD might be a way to help under-illustrated articles gain an image in many languages. Would probably also need some rough analysis of how many articles without images have infoboxes that could pull an image.

Thanks @Carly. According to our estimation, about 95% of Wikidata items are missing images. I am happy to extend the analysis to Wikidata only.

Weekly updates:
Computed number of images (excluding icons) for each article for each edition of Wikipedia listed in the Wiki comparison spreadsheet: https://docs.google.com/spreadsheets/d/1a-UBqsYtJl6gpauJyanx0nyxuPqRvhzJRN817XpkuS8/edit?usp=sharing

Thanks @Carly. According to our estimation, about 95% of Wikidata items are missing images. I am happy to extend the analysis to Wikidata only.

That's great, thanks!

Weekly updates:
I started the analysis of distribution of images across Wikipedia. I computed the number of non-icon images in each article across all 300 Wikipedias. Icons for now are detected as images that are present in more than 50 articles. This is likely to change as @Swagoel is working on a better way to identify icons.
What we know so far:

Image Distribution Across Wikis

  • The distribution of unillustrated articles varies a lot across Wikipedia edition. For English Wikipedia, the percentage of articles having an image is around 50%. On the contrary, Venetian Wikipedia's articles are almost all illustrated (~86%), while Cebuano Wikipedia is largely unillustrated (only 7% of articles have an image). There is a minor correlation between Wikipedia size and percentage of articles without images. Smaller Wikipedias tend to be more illustrated.
  • The number of images per illustrated article is much more uniform across Wikipedia editions. Most Wikis have 2-4 images per article, with outliers such as Karachay and Gagauz (both Turkic languages), which have up to 7 images in average per illustrated article.

Below are two static plots of Wikipedia size vs % of illustrated articles, and Wikipedia size vs number of images in illustrated articles. The interactive version of these plots can be found at this link

Image Distribution by Article Length
We divided Wikis by size: very small (<1'000 articles), small (<10'000 articles), medium (<100'000 articles), large (<1M articles), very large (>1M articles).
We also partitioned articles by length, according to the number of articles across all wikis having a given length: very short (bottom 20%), short (mid-low 20%), medium (mid 20%), long (mid-top 20%), very long (top 20%).
We computed percentage of illustrated articles and number of images for all combination of wiki size and article length.

  • For mid to large Wikipedias, shorter articles are less likely to have an image. In smaller Wikis, the probability of having an image does not seem to depend on article length.
  • Across all wikis, the shorter the article, the smaller the number of images.

See plots below:

Image Distribution By Article Topics
I computed a topic for each article in each Wiki based on Isaac' topic classifier, then computed the distribution of illustrated articles and number of images per articles across different topics.
Below you can find the resulting plots. The bars in the bottom quadrant reflect the number of articles in a given topic. Will follow up with the analysis of these results.

More analysis on Wikidatam since @CBogen asked a while ago.

I sampled 5 million Wikidata items, and extracted, for each item, the following:

  • has_image: whether the item has an image or not
  • coverage: the number of Wikipedia articles linking to the item. This is quantized into 4 values:
    • none if 0 articles link to the item
    • small if 1-10 articles link to the item
    • medium if 10-100 articles link to the item
    • large if >100 articles link to the item

Here are the major insights:

  • Less than 4% of Wikidata items have an image associated with them, more precisely, 3.93%.
  • Less than 1/4 of Wikidata items is linked to 10 or more Wikipedia articles, around 38% has 0 links, and around 40% has between 1 and 10 links
  • Items with larger coverage are more likely to have an image: in average, more than 70% of items with large coverage has an image associated with them, against only around 1% of illustrated items with none coverage.

Below the plot summarizing the insights above. Insights on topic distribution in the next post.

More insights on Wikidata images vs topics. @FRomeo_WMF might be of interest for you.

For each Wikidata item with at least 1 link to Wikipedia (e.g. small, medium, and large coverage) I associated one or more topics using the method based on @Isaac's Wikidata topic classifier.

Results are reported below and are similar to what seen for Wikipedia:

  • The most widely illustrated items are about Food, Transport, Fashion, Engineering, and Visual Arts, between 55 and 75 percent of items in these categories are widely illustrated.
  • Items about Europe and America are far more likely to be illustrated than items from Africa, Asia and South America, with only about 20% of illustrated items for Africa vs 50% for Europe
  • Items about Movies, Books, TV, and Videogames are rarely illustrated, probably due to copyrighted material?
  • Items about women are more likely to have an image than an average item about a person!, with 53% of illustrated items about women, vs 50% of illustrated articles about people overall
  • For some areas, such as Biology, Chemistry and others, there is a mismatch between Wikipedia and Wikidata image coverage, I will investigate this further and understand the proportion of unillustrated Wikipedia articles for which we have Wikidata images, and partition it by topic.


Reporting below the distribution of illustrated articles by topic across Wikipedias for comparison:

Miriam closed this task as Resolved.EditedJan 11 2021, 12:30 PM

I reported and extended the analysis above on Meta:
https://meta.wikimedia.org/wiki/Research:Map_of_Visual_Knowledge_Gaps

Closing this for now - I will capture any outstanding todo in future Phabircator tasks.