Page MenuHomePhabricator

Assess prevalence of Wikidata infoboxes
Closed, ResolvedPublic

Assigned To
Authored By
MMiller_WMF
Jan 15 2021, 12:37 AM
Referenced Files
F34105379: kowiki.txt
Feb 15 2021, 8:25 AM
F34105373: dewiki.txt
Feb 15 2021, 8:25 AM
F34105392: fawiki.txt
Feb 15 2021, 8:25 AM
F34105378: hywiki.txt
Feb 15 2021, 8:25 AM
F34105390: bnwiki.txt
Feb 15 2021, 8:25 AM
F34105337: chart.png
Feb 15 2021, 8:25 AM
F34105388: cswiki.txt
Feb 15 2021, 8:25 AM
F34105391: ptwiki.txt
Feb 15 2021, 8:25 AM

Description

In speaking with community members about the potential image recommendations task, we learned about Wikidata infoboxes (see conversation here). Wikidata infoboxes are templates that pull information directly from Wikidata into a Wikipedia article, including an image from P18, if there is one.

Here is an example on Czech Wikipedia, in which the entire infobox is created just from this one line of Wikitext for "infobox person": {{Infobox - osoba}}

It is important to understand these infoboxes, because for articles that have them, we may not want to place suggested images into the wikitext of the article. Rather, we may want to add them as the P18 for the Wikidata item so that it is automatically included in the infobox. Or, for a first version of the feature, we may want to exclude any articles that have these infoboxes.

We want to know how many of the unillustrated articles with image matches have Wikidata infoboxes, so that we know how important it is to handle these infoboxes well in any feature we make.

We're not sure the best way to figure out what the set of templates are that are Wikidata infoboxes on each wiki. But if we are able to do that, below is what we want to calculate.

For a set of wikis, we want to calculate six things:

  • Total number of articles in the wiki
  • Unillustrated articles in the wiki
  • Articles with match from any source (polished): this means the count of unillustrated articles that have a match from any of the three sources, after the "polishing" steps to remove local images, etc.
  • Have Wikidata infobox: of the unillustrated articles with a match, the count of how many of them have a Wikidata infobox in them.

Here is a table with a sample row showing the output that we want:

wikiTotal number of articlesUnillustrated articlesArticles with match from any source (polished)Have Wikidata infobox
frwiki2,000,0001,000,000150,00010,000

The list of wikis for which we want these numbers is:

  • enwiki
  • arwiki
  • kowiki
  • cswiki
  • viwiki
  • frwiki
  • fawiki
  • ptwiki
  • ruwiki
  • trwiki
  • plwiki
  • hewiki
  • svwiki
  • ukwiki
  • huwiki
  • hywiki
  • srwiki
  • euwiki
  • arzwiki
  • cebwiki
  • dewiki
  • bnwiki

Details

Due Date
Jan 26 2021, 8:00 AM

Event Timeline

@Miriam said she would talk to @Isaac about this. I'm also tagging @Urbanecm_WMF, @Trizek-WMF, and @Tgr in case they have ideas how to figure out which templates we we're looking for.

In Czech Wikipedia, all templates using data from Wikidata are in a
category,
so we can use that for the analysis. Note the category doesn't contain
_only_ infoboxes, but infoboxes can be easily filtered out thanks to their
specific page titles.

In Czech Wikipedia, all templates using data from Wikidata are in a category, so we can use that for the analysis. Note the category doesn't contain _only_ infoboxes, but infoboxes can be easily filtered out thanks to their specific page titles.

Adding to that, there's also a more specific category on certain wikis such as English that's not just Wikidata templates but Wikidata-infobox templates -- e.g., en:Category:Infobox_templates_using_Wikidata. This will get you a conservative estimate of pages that just maybe might be adding an image automatically via Wikidata.

The challenge you'll have is that there's a big difference between an infobox that uses Wikidata somehow and infobox that would actually automatically use an image from Wikidata:

  • For example, the first template (Infobox crater) listed in that category actually is using P138 (named after) and not P18 so any page that transcludes it would be a false positive.
  • A bit further down the list is Infobox AFL biography which does use P18 and would pull in the image (yay!).
    • If you go through the articles though that use the template, you'll find it much more common that the editors still included a hard-coded link to the image in the infobox as opposed to relying on Wikidata -- ex1, ex2, ex3. This might not matter for your use case but you'll want to make sure you're not somehow double-counting these pages.
  • You'll also find a number of templates that use P18 but only to generate tracking categories -- i.e. they just check whether there's an image on Wikidata but don't actually use it. For example: Infobox book.

I haven't done it, but if you want to distinguish between these use cases, you probably need to write a script that would parse the documentation associated with each template -- e.g., for https://en.wikipedia.org/wiki/Template:Infobox_book, it's at the doc subpage: https://en.wikipedia.org/wiki/Template:Infobox_book/doc. There, usually it has a Tracks Wikidata template with the properties that are tracked and/or Uses Wikidata template with the properties that are used. I'd have to do some more digging to see how complete this would be -- i.e. would it capture all Wikidata templates using/tracking images? -- but my gut feeling after looking at many of these templates is that it's mostly complete for English. I'm not sure if other languages have these same templates / norms though so that would require some further digging too.

There is usage tracking for which Wikipedia article relies on which Wikidata claim, so a lower estimate could be obtained by just counting how many of those pages use P18. But the usage tracking might or might not work depending on how the infobox is written (if the Lua code just fetches all claims in a single blob, the usage tracking feature can't discern which one of them gets used) so it might be very much off on some wikis.

I think a reasonably good approximation would be:

  1. check P18
  2. check whether the article uses that image
  3. check whether the image name appars in the article wikitext

...of course that gives the number of illustrated articles using Wikidata, not the number of unillustrated ones that could use it. But it still might be informative. It could also be used for identifying which infobox templates use wikidata (use Category:Infobox to enumerate the templates, then check how many of them seem to pull Wikidata images with a significant frequency) but that might be a bit overcomplicated.

There is usage tracking for which Wikipedia article relies on which Wikidata claim, so a lower estimate could be obtained by just counting how many of those pages use P18. But the usage tracking might or might not work depending on how the infobox is written (if the Lua code just fetches all claims in a single blob, the usage tracking feature can't discern which one of them gets used) so it might be very much off on some wikis.

Yeah, good idea and super easy to calculate though hard to know how to interpret the number. To @Tgr's point about Lua, for English Wikipedia, where I'm most familiar, a substantial percentage of articles that actually do transclude P18 will have this information overwritten by usage of the templates like Authority Control that do fetch all the claims in a single blob. That would bias the count low. Conversely, an article tracking presence of a P18 property (but not using the image) would be indistinguishable from an article actually using images stored under P18. That would bias the count high if you're trying to see how many articles use images from Wikidata.

I think a reasonably good approximation would be:

  1. check P18
  2. check whether the article uses that image
  3. check whether the image name appars in the article wikitext

Yeah, this is more computationally-intensive but also quite doable and I think more informative. You'd have to do some manual verification, but if an image under P18 for a Wikidata item (1) matches usage via imagelinks (2) without the article title being present in the article wikitext (not 3), that might be a very good heuristic for detecting Wikidata-transcluded images. Thankfully, all of that information is loaded into Hive tables on a monthly basis so this could be calculated fairly quickly once the query is defined. This will help with understanding Wikidata-image-infobox usage though it would be harder to extend this to identifying articles that transclude a Wikidata-image-infobox but are lacking an image without knowing the exact templates that transclude images from Wikidata.

If no image is found, can we check if there is one on Wikidata, and suggest to include it?

After discussing with @MMiller_WMF the best way forward seems to be to count the pages with an associated P18 image per wiki, and count approximately what fraction of those pages actually use that image (as above: check that the image is present in imagelinks but not present in the wikitext). That will give an idea of how popular/accepted Wikidata images are on that wiki, and it's a lot easier than identifying which infoboxes support Wikidata images.

Thanks, @Tgr.

@Miriam is going to take the task from here.

@AikoChou just joined as a contractor. Her first task will be to look into this while we wait for her to get server access.

Hi all,

Result of how popular Wikidata images are used in pages per wiki is shown as follows:

A summary chart-

wikidata-image.png (421×682 px, 26 KB)

Yellow bars are the number of pages linked to an illustrated wikidata; turquoise bars are the pages whose image is the same as the wikidata item. Labels on the right are the percentage of articles linked to an illustrated wikidata item adopting a wikidata image.

The table lists detail numbers per wiki-

Pages adopting wikidata imagePages linked to an illustrated wikidataPercentage
enwiki162335151345110.73%
dewiki12740381321715.67%
frwiki12196481280215.01%
plwiki10053256342017.84%
ruwiki10074955178618.26%
svwiki11254552976421.24%
ptwiki7326343225616.95%
ukwiki9033139771322.71%
cebwiki6353638101216.68%
arwiki3577234753910.29%
arzwiki179823366665.34%
fawiki4090032165912.72%
viwiki4524931169414.52%
huwiki7217723764330.37%
euwiki7649923071233.16%
cswiki3345122378414.95%
srwiki3536818090319.55%
kowiki3741417525521.35%
trwiki1869115327812.19%
hewiki1385013100310.57%
hywiki2056710960718.76%
bnwiki3444366159.41%

We just chatted with @AikoChou -- since we used the globalimagelinks table to get image information both for Wikidata and Wikipedia, these counts might still include icons. She will work on removing icons and recalculate these numbers (they won't probably change much).

Hi all!
Here are the results after removing icons (.svg). Overall, these numbers drop slightly but not change much.

Chart-

wikidata-image.png (421×682 px, 26 KB)

  • Yellow: pages linked to an illustrated wikidata (3rd column in the table below)
  • Turquoise: pages whose image is the same as the wikidata item. (2nd column in the table below)
  • Label: percentage of articles linked to an illustrated wikidata item adopting a wikidata image. (4th column in the table below)

Table-

enwiki13214814011469.43%
dewiki9818674925013.10%
frwiki9633074593212.91%
plwiki8317751044716.29%
ruwiki8610950508117.05%
svwiki9956149681420.04%
ptwiki5766038594014.94%
ukwiki7235935604920.32%
cebwiki6127635168317.42%
arzwiki141143265064.32%
arwiki289513179359.11%
fawiki2898728456010.19%
viwiki3485627944912.47%
huwiki5664821727326.07%
cswiki2765120852613.26%
euwiki6858220135234.06%
kowiki3019615911518.98%
srwiki2908215205519.13%
trwiki135221388919.74%
hewiki108531216458.92%
hywiki167409798917.08%
bnwiki2526331967.61%

Hi @AikoChou -- it's nice to meet you, and I'm glad you've joined us and are helping already! Thank you for posting these numbers. I want to make sure I understand them. Can you tell me if this is the right way to read it:

"On English Wikipedia, although something like 99% of articles are linked to Wikidata items, only 1,401,146 articles are linked to Wikidata items that have an image in P18. Of those, 132,148 (9.43%) of the articles actually have the P18 image in the article, which we assume is because they have a Wikidata infobox."

Here are my questions:

  • Is it possible that articles are in the turquoise group even if they don't have a Wikidata infobox, but just because the P18 image is placed manually in the article's wikitext?
  • Why does removing icons change the numbers? Wouldn't it only change the numbers if an article has an icon as its P18?
  • Would it be easy for you to export a list of some of the articles from each of the wikis, so I can click through and see their circumstances? Here's what I would be looking for: for each of the wikis, 10 random articles from the yellow group and 10 random articles from the turquoise group.

"...which we assume is because they have a Wikidata infobox."

FWIW, especially for large wikis, the inverse is not that unlikely: the image is an infobox parameter, and some bot copied it to Wikidata's P18. (Normally such a bot would also remove it from the infobox, but that needs community consensus and enwiki has often been skeptical about Wikidata which they see as a younger/smaller project with lower editorial standards.)

Thanks, @Tgr. I didn't know that happens. But given that we don't know how often it happens, is the way I stated my assumption correct, for the data that Aiko pulled?

I think it's a reasonable approximation, but also, it shouldn't be hard to get an accurate number, by checking whether the file name appears in the article.

I see. @AikoChou -- what do you think of all of the above, and what do you think about this idea of adding another column which counts how many pages that have the Wikidata image don't have that image name in the wikitext of the page? This would tell us how many are really through through Wikidata infoboxes, and not just incidentally.

Hi @MMiller_WMF @Tgr -- it's very nice to meet you too. I'm really happy to have the opportunity to help :D

First I want to apologize something was wrong for the result I reported last time. I update the numbers that have been corrected as follow. I'll explain them and then answer your questions.

To get image information, I queried the databases and joined two tables: globalimagelinks and wb_items_per_site. But I didn't notice the page title in globalimagelinks is separated by underscores (e.g. 'Georges_Florovsky'), while the page title in wb_items_per_site is separated by space (e.g. 'Georges Florovsky'), so the articles in the turquoise group I reported only contains those have a single-word title (e.g. 'Universe').

I recalculated these numbers by replacing spaces in the title with underscores when querying the tables. Here is the result:

Chart-

chart.png (443×682 px, 30 KB)

Table-

Apparently the articles in the turquoise group increase greatly, because last time we basically neglected all the articles that have longer title.

About how to interpret these numbers --

"On English Wikipedia, although something like 99% of articles are linked to Wikidata items, only 1,401,146 articles are linked to Wikidata items that have an image in P18. Of those, 132,148 (9.43%) of the articles actually have the P18 image in the article, which we assume is because they have a Wikidata infobox."

Since we use globalimagelinks table to get image info, the image is not necessarily to be in P18. They may be in other more specific image properties such as P242 (locator map image) or P94(coat of arms image), as long as they are in jpg, jpeg, png format.

The articles in the turquoise group mean they use the same image that Wikidata have, but they are not necessarily used in the infobox, they may appear elsewhere in the article, or added manually by editors.

Here are some examples:

  • ex1 (Q1453320) : In the infobox but hard-coded link to the image
  • ex2 (Q158154): In other sections of the article (not lead picture)
  • ex3 (Q1418639): Is lead picture but there is no infobox

Is it possible that articles are in the turquoise group even if they don't have a Wikidata infobox, but just because the P18 image is placed manually in the article's wikitext?

Yes, it is possible. Such as the example 2 and 3 above.

Why does removing icons change the numbers? Wouldn't it only change the numbers if an article has an icon as its P18?

Icons may appear in different image properties, not only P18. The approach we use to remove icons is simply not considering image of .svg format.

Would it be easy for you to export a list of some of the articles from each of the wikis, so I can click through and see their circumstances? Here's what I would be looking for: for each of the wikis, 10 random articles from the yellow group and 10 random articles from the turquoise group.

I attach a list of random articles for each of the wikis. In the file, first column is the article title; second column is the Wikidata QID; third column is the image file name. Hope they will help.



FWIW, especially for large wikis, the inverse is not that unlikely: the image is an infobox parameter, and some bot copied it to Wikidata's P18. (Normally such a bot would also remove it from the infobox, but that needs community consensus and enwiki has often been skeptical about Wikidata which they see as a younger/smaller project with lower editorial standards.)

@Tgr -- I don't really understand this part. Do you mean some images in Wikidata's P18 are imported from Wikipedia's articles? Like this Q1080021, there is one reference says "imported from Wikimedia project: Czech Wikipedia". Is it the case you talked about?

What do you think about this idea of adding another column which counts how many pages that have the Wikidata image don't have that image name in the wikitext of the page? This would tell us how many are really through through Wikidata infoboxes, and not just incidentally.

I agree we should do this if we want to exclude the cases like example 1-3 above, since they all have the file name appears in the wikitext. So yeah, I'll discuss it with @Miriam today.

@Tgr -- I don't really understand this part. Do you mean some images in Wikidata's P18 are imported from Wikipedia's articles? Like this Q1080021, there is one reference says "imported from Wikimedia project: Czech Wikipedia". Is it the case you talked about?

Yeah, in the past a lot of Wikidata content was populated by automatically copying data over from Wikipedia (which is a much older project).

HI @MMiller_WMF @Tgr thanks for your comments!

@AikoChou has now access to our servers and will be able to run a more accurate query incorporating all your feedback. The previous query was an approximation of your requests given the limited computational resources / access to data she had (she could use the data from mysql only, as parsing the Wikipedia / Wikidata dumps would require a lot of time and resources on toolforge) . More results coming soon!

Hi all,

This is result I queried from hive tables and parsed article wikitext:

3673923 wikidata items have P18
-----
arwiki: 370725 pages link to wikidata with P18
arwiki: 286740 pages uses that image
arwiki: 207201 pages which the image name doesn’t appear in the article wikitext
----- 0.7226093324963382
cswiki: 258110 pages link to wikidata with P18
cswiki: 161206 pages uses that image
cswiki  51262 pages which the image name doesn’t appear in the article wikitext
----- 0.31799064550947237 
arwiki: 370725 pages link to wikidata with P18
arwiki: 286740 pages uses that image
arwiki  207201 pages which the image name doesn’t appear in the article wikitext
----- 0.7226093324963382
kowiki: 209266 pages link to wikidata with P18
kowiki: 90118 pages uses that image
kowiki  2093 pages which the image name doesn’t appear in the article wikitext
----- 0.02322510486251359
euwiki: 229033 pages link to wikidata with P18
euwiki: 188015 pages uses that image
euwiki  170207 pages which the image name doesn’t appear in the article wikitext
----- 0.9052841528601442
viwiki: 322435 pages link to wikidata with P18
viwiki: 150325 pages uses that image
viwiki  36729 pages which the image name doesn’t appear in the article wikitext
----- 0.24433061699650757
frwiki: 883387 pages link to wikidata with P18
frwiki: 607412 pages uses that image
frwiki  148341 pages which the image name doesn’t appear in the article wikitext
----- 0.2442180924973494
fawiki: 336424 pages link to wikidata with P18
fawiki: 170821 pages uses that image
fawiki  2702 pages which the image name doesn’t appear in the article wikitext
----- 0.015817727328607138
ptwiki: 450707 pages link to wikidata with P18
ptwiki: 248824 pages uses that image
ptwiki  45148 pages which the image name doesn’t appear in the article wikitext
----- 0.1814455197247854
ruwiki: 586084 pages link to wikidata with P18
ruwiki: 418089 pages uses that image
ruwiki  178944 pages which the image name doesn’t appear in the article wikitext
----- 0.42800456362162126
trwiki: 170170 pages link to wikidata with P18
trwiki: 76243 pages uses that image
trwiki  1267 pages which the image name doesn’t appear in the article wikitext
----- 0.016617919022074157
plwiki: 609325 pages link to wikidata with P18
plwiki: 421096 pages uses that image
plwiki  147666 pages which the image name doesn’t appear in the article wikitext
----- 0.3506706309250147
hewiki: 153080 pages link to wikidata with P18
hewiki: 83692 pages uses that image
hewiki  31413 pages which the image name doesn’t appear in the article wikitext
----- 0.37534053434019976
svwiki: 584249 pages link to wikidata with P18
svwiki: 331169 pages uses that image
svwiki  84492 pages which the image name doesn’t appear in the article wikitext
----- 0.2551325758147653
ukwiki: 415714 pages link to wikidata with P18
ukwiki: 270324 pages uses that image
ukwiki  113151 pages which the image name doesn’t appear in the article wikitext
----- 0.418575487193146
huwiki: 258874 pages link to wikidata with P18
huwiki: 186396 pages uses that image
huwiki  122506 pages which the image name doesn’t appear in the article wikitext
----- 0.6572351338011545
hywiki: 118184 pages link to wikidata with P18
hywiki: 84474 pages uses that image
hywiki  65649 pages which the image name doesn’t appear in the article wikitext
----- 0.7771503657930251
srwiki: 177341 pages link to wikidata with P18
srwiki: 113600 pages uses that image
srwiki  33029 pages which the image name doesn’t appear in the article wikitext
----- 0.2907482394366197
arzwiki: 373333 pages link to wikidata with P18
arzwiki: 352816 pages uses that image
arzwiki  349541 pages which the image name doesn’t appear in the article wikitext
----- 0.9907175411545962
cebwiki: 399642 pages link to wikidata with P18
cebwiki: 310698 pages uses that image
cebwiki  166613 pages which the image name doesn’t appear in the article wikitext
----- 0.536253854225003
dewiki: 877079 pages link to wikidata with P18
dewiki: 522120 pages uses that image
dewiki  3705 pages which the image name doesn’t appear in the article wikitext
----- 0.007096069868995633
bnwiki: 42552 pages link to wikidata with P18
bnwiki: 18209 pages uses that image
bnwiki  471 pages which the image name doesn’t appear in the article wikitext
----- 0.025866329836893843
enwiki: 1645397 pages link to wikidata with P18
enwiki: 1157335 pages uses that image
enwiki: 19539 pages which the image name doesn’t appear in the article wikitext
----- 0.01688275218497669

Result shows the prevalence of wikidata infobox is very different from wiki to wiki.
I also manually checked random articles for those wikis with very high percentage (like arzwiki) and very low percentage (like enwiki).
If you have any questions, I will be happy to answer them.

Resolving this for now, as all experiments are done.