Page MenuHomePhabricator

Image matching algorithm
Open, Needs TriagePublic

Description

The Growth and Android teams are both thinking about a new type of structured task for use in their "suggested edit" feeds. The task would involve newcomers adding images to articles that don't have any.

  • For Android, it would likely be that a user is presented with an image and asked if it is appropriate for illustrating a given article.
    • The core use case is that article X in English has a lead image, but does not have a lead image in Hindi. So we ask a user who speaks both English and Hindi to verify that it is ok to copy the lead image from English to Hindi.
    • A secondary use case is to copy an image from article X to related article Y, whether within the same language or between languages.
  • For Growth, it would likely be that a user is on an article, and receives a suggestion for an image to illustrate that article.

This task is for @Miriam to attempt a simple approach for suggesting these article/image pairs using Wikidata and interlanguage links (and whatever other methods she recommends). The output might look like:

For a set of article titles lacking any images, list one or more image suggestions from Commons, perhaps ranked in some order. Perhaps this could be done for 100 random unillustrated articles in each of these wikis: enwiki, frwiki, arwiki, kowiki, cswiki, viwiki.

We can take these suggestions and ask native speakers to evaluate whether they are good matches, and then have a sense for the accuracy of our approach.

It would also be valuable to know for how many articles in each of those wikis suggestions can be made, so we know how deep the pool of suggestions might be in a given wiki.

Related Objects

Event Timeline

MMiller_WMF added subscribers: Ramsey-WMF, Abit.

@Miriam -- here's the task that we talked about making so you could give this approach a try. What do you think? Does this sound doable? On what timeline would you prefer?

@Charlotte -- did I accurately represent Android's plans and needs here? Please feel free to edit the task description.

@Ramsey-WMF and @Abit -- Miriam volunteered to generate some results with this simple approach so that we could take a look. Please follow along if you want!

@Miriam -- here's the task that we talked about making so you could give this approach a try. What do you think? Does this sound doable? On what timeline would you prefer?

@Marshall thanks for putting this together. The task sounds feasible to me.
As discussed, the method can be built with different levels of complexity. The basic level involves language interlinks + structured search, with some light refinement based on computer vision.
I should be able to work on this task in the first half of Q1. @leila FYI

Thanks @MMiller_WMF - I updated the task description to reflect our use cases. Glad to hear that this looks feasible @Miriam - image tools are pencilled into the Android roadmap for Q2, so it looks like we might be the first to work with this model. Happy to talk through Android-related specifics if that's useful.

Update!
I worked on estimating the number of pages missing images for each language using the imagelinks table.

  • One issue is that the imagelinks table stores, for each page, all images, including icons, flags, and wikiproject and Wikimedia project logos.
  • These images are not informative of the article, as they can be found in most Wikipedia pages and they mostly indicate presence of metadata. For example the image below is present in thousands of article in frwiki as it indicates the result of a disambiguation
  • We need a way to exclude these images as part of the image counts: we need a white list of allowed images, so that if an article only have these very frequent images, it is counted as not having images.
  • To generate this list, i counted the frequency of all images in a Wikipedia, and put images in the white list if they are present in more than 1/2000 of the pages, or in 100 or more pages. These threshold are arbitrary and mainly decided by eyeballing the results. We might want to have some tradeoff plots to understand the impact of this threshold. Examples of whitelisted images for frwiki:

Results:

arwiki

total pages:1050283
pages without images:599286
% without images:0.5705947825490844
allowed images:2769

frwiki

total pages:2231963
pages without images:1036098
% without images:0.4642093081292118
allowed images:5183

koiki

total pages:502116
pages without images:289668
% without images:0.5768945821284325
allowed images:1283

enwiki

total pages:6112529
pages without images:3124728
% without images:0.5112005194576582
allowed images:11234

viwiki

total pages:1249195
pages without images:931057
% without images:0.7453255896797537
allowed images:1586

cswiki

total pages:457829
pages without images:176739
% without images:0.3860371448728674
allowed images:629

Thanks for the update, @Miriam! The most important thing I'm seeing in your results is that very large swaths of these wikis are unillustrated: with these numbers, we should have plenty of articles of every topic for newcomers to add images to.

Regarding the "allowed images", I think your method of excluding them makes sense to me. In the numbers you posted, when it says like allowed images: 629, does this mean it's 629 distinct images that were allowed? Or distinct articles?

Related to the "allowed images", I think that we are generally more interested in accuracy than coverage. With so many articles that are unillustrated, I was thinking that if we naively restrict our suggestions to articles that have no images of any sort, there would still be lots of recommendations. Or is it that large percentages of the articles have those "allowed images"?

And what are your next steps in general, @Miriam?

Hi @MMiller_WMF, thanks for the feedback!

Thanks for the update, @Miriam! The most important thing I'm seeing in your results is that very large swaths of these wikis are unillustrated: with these numbers, we should have plenty of articles of every topic for newcomers to add images to.

Yes, one issue is - for how many of those will we have candidate images (see next steps)?

Regarding the "allowed images", I think your method of excluding them makes sense to me. In the numbers you posted, when it says like allowed images: 629, does this mean it's 629 distinct images that were allowed? Or distinct articles?

629 distinct images are detected as frequent (logos/flags/icons) and added to the whitelist.

Related to the "allowed images", I think that we are generally more interested in accuracy than coverage. With so many articles that are unillustrated, I was thinking that if we naively restrict our suggestions to articles that have no images of any sort, there would still be lots of recommendations. Or is it that large percentages of the articles have those "allowed images"?

I initially tried the conservative approach, however the majority of the unillustrated articles have those icons and logos which we include in the "allowed images".
If accuracy is key, we can increase the threshold of the number of pages containing an image in the whitelist, so to make sure that those "allowed" images are very frequent. The number of articles which are completely free of images is much lower, and varies across wikis, as it depends on the language-specific policies for logo/icon usage, see the stats below:

frwiki

total pages:2231963
pages without images:68
% without images:3.046645486506721e-05

enwiki

total pages:6112529
pages without images:621083
% without images:0.10160818868916614

viwiki

total pages:1249195
pages without images:55919
% without images:0.04476402803405393

arwiki

total pages:1050283
pages without images:95
% without images:9.045181155936067e-05

kowiki

total pages:502116
pages without images:51566
% without images:0.10269738466808466

cswiki

total pages:457829
pages without images:47661
% without images:0.10410218662426364

And what are your next steps in general, @Miriam?

The next steps would be:

  1. Find information about Wikidata items of unillustrated articles.
    • Item ID and Label
    • Instance-of (which class does the article belong to?)
    • Item Image (this is to estimate the %of unillustrated articles which can be resolved using the WD image)
    • Page links (pages in other languages linked to the WD item)
    • Commons category (some items are linked to a category in Wikimedia Commons)
  2. For each Wikidata item, get image candidates:
    • Get images on pages linked to the WD item - imagelinks
    • Get images in Commons tagged with the Wikidata item (@EBernhardson what is the fastest way to do this? Do we have anything available in Hive? Or - could I directly retrieve MediaSearch results somewhere?)
    • Get images in categories linked to the WD item
  3. For each Wikidata item, refine the image candidate list:
    • Find most relevant based on string similarity
    • If the item is a person, run a face detector to make sure that the image is a portrait (we can have more of these kind of constraints)
    • Rank images by quality based on quality classifier.
  1. For each Wikidata item, get image candidates:
    • Get images in Commons tagged with the Wikidata item (@EBernhardson what is the fastest way to do this? Do we have anything available in Hive? Or - could I directly retrieve MediaSearch results somewhere?)

Currently the only way to access them would probably be fetching the CirrusSearch dumps and loading them into hive. I have a tiny spark script (P11997) that massages the cirrus dump format into something spark can natively read. From there it can be stored to hive and then queried as normal. The dumps also have to be copied over from the dump servers, and ideally split up into many input files to allow parallelism of the read, but it should finish eventually even when fed the initial file as one giant piece as well i think.

Thanks @EBernhardson, I'll take a look! Fyi, @nettrom_WMF created T258834 to get a table of structured data entities on Commons in Hive.

  1. Find information about Wikidata items of unillustrated articles.
    • Item ID and Label
    • Instance-of (which class does the article belong to?)
    • Item Image (this is to estimate the %of unillustrated articles which can be resolved using the WD image)
    • Page links (pages in other languages linked to the WD item)
    • Commons category (some items are linked to a category in Wikimedia Commons)

Update: - I just completed this first step. I am only missing pagelinks, which I will incorporate to next step.

  • I extracted the Wikidata ID, label and properties for the top-500'000 unillustrated articles by length ( I couldn't do it for all due to comp resources limitation)
  • The most interesting finding is that a non-negligible percentage of those articles already has an image attached to the Wikidata item, or a commons category linked to the Wikidata item:

frwiki

number of wikidata items with images: 8120
number of wikidata items with commons category: 25663

cswiki

number of wikidata items with images: 7099
number of wikidata items with commons category: 20005

enwiki

number of wikidata items with images: 8833
 number of wikidata items with commons category: 19974

viwiki

number of wikidata items with images: 39261
number of wikidata items with commons category: 45433

arwiki

number of wikidata items with images: 6195
number of wikidata items with commons category: 26553

kowiki

number of wikidata items with images: 20776
number of wikidata items with commons category: 31840
  • Note that to run these experiments at scale, I have to work on hive data rather than the SQL replicas. The tables I am working with are updated once a month, at the beginning of the month. This means that at the time of the analysis, some of the articles which were unillustrated at the beginning of the month, might already have been resolved. This is especially true this month as July was the month of the #WPWP campaign

P.S. Top topics of articles missing images, by language (topic = value of Wikidata "instance of" property)
kowiki
"television series": 3361
"taxon": 4087
"film": 12093
"Wikimedia disambiguation page": 19231
"human": 34802
arwiki
"bilateral relation": 12588
"sports season": 12782
"village": 21848
"asteroid": 25795
"human": 196864
viwiki
"commune of France": 10791
"asteroid": 11867
"village in Turkey": 14077
"human": 27665
"taxon": 321621
frwiki
"taxon": 10012
"sports season": 12577
"film": 22613
"asteroid": 42014
"human": 166133
enwiki
"sporting event": 7191
"tennis event": 12566
"sports season": 31474
"Wikimedia list article": 39924
"human": 171394
cswiki
"sports season": 2459
"film": 4293
"album": 4603
"Wikimedia disambiguation page": 16909
"human": 41733

  1. For each Wikidata item, get image candidates:
    • Get images on pages linked to the WD item - imagelinks
    • Get images in Commons tagged with the Wikidata item
    • Get images in categories linked to the WD item

Update on step 2. After generating lists of unillustrated Wikipedia articles, and extracting some properties on the top 500k by length, I focused on finding image candidates for such unillustrated articles:

  • First, I created a global whitelist of images which are very frequent across the 6 languages such as icons and logos. The purpose of this step is to avoid collecting such whitelisted images as potential candidates for unillustrated articles. Note: This step has to be refined to include frequent images across *all* languages.
  • As proof of concept, I took 1000 random unillustrated articles for each language, and searched for image candidates across different sources:
    • Image Interlinks For each of the unillustrated pages, e.g. 'Barack Obama' in English, I collected via Wikidata the corresponding articles in other languages e.g. 'Barack Obama' in Spanish, and extracted, if any, the images in such articles (e.g. President_Barack_Obama.jpg), excluding the whitelisted images. This was done using the mediawiki-imagelinks and mediawiki-pagelinks table in Hive.
    • Wikidata Image From the previous step, I had information about the image (P 18) linked to the Wikidata item of each unillustrated page.
    • Wikidata Commons Category From the previous step, I had information about the Commons category (P 373) linked to the Wikidata item of each unillustrated page. I used the categorylinks table in the mysql replicas (we don't have this table in Hive) to get the list of images in the Commons category specified for a page/QID, when present.
    • Structured Data For each Wikidata item of unillustrated articles, I searched for Commons images tagged with the corresponding QID in the 'depicts' statement. To do so, I used the Mediawiki Commons Api (example query). This is a temporary solution while we wait to have structured data on Hive (T258834)

Below are some stats on the volume of image candidates retrieved for 1000 unillustrated articles, broken down by source type.

  • Main findings:
    • Between 30% (enwiki) and 78 (viwki)% of the unillustrated articles have at least one matching image.
    • We find, in average, a lot more candidates for smaller wikis.
    • Interlinks are the largest source of image candidates, structured data the smallest.
    • [not reported below] Articles with matching image candidates are mostly about "human", "taxon", "comune", and "year". These stats might not be 100% reliable as the large majority of articles do not have a 'P31' associated to their Wikidata item, so we don't really know their class.
  • Limitation: these numbers might be an overestimation as some matching images might still be icons or logos.

enwiki

interlinks -- found for: 366 articles; avg candidates/article:4.5
wikidata image -- found for: 1 articles; avg candidates/article:1.0
wikidata category -- found for: 7 articles; avg candidates/article:33.1
structured data -- found for: 5 articles; avg candidates/article:1.8
**tot articles with candidates: 368; avg candidates/article: 5.1**

arwiki

interlinks -- found for: 501 articles; avg candidates/article:7.2
wikidata image -- found for: 12 articles; avg candidates/article:1.0
wikidata category -- found for: 56 articles; avg candidates/article:26.7
structured data -- found for: 11 articles; avg candidates/article:14.0
**tot articles with candidates: 507; avg candidates/article: 10.4**

frwiki

interlinks -- found for: 568 articles; avg candidates/article:6.5
wikidata image -- found for: 4 articles; avg candidates/article:1.0
wikidata category -- found for: 18 articles; avg candidates/article:11.3
structured data -- found for: 3 articles; avg candidates/article:4.3
**tot articles with candidates: 573; avg candidates/article: 6.8**

viwiki

interlinks -- found for: 787 articles; avg candidates/article:6.6
wikidata image -- found for: 33 articles; avg candidates/article:1.0
wikidata category -- found for: 49 articles; avg candidates/article:25.4
structured data -- found for: 21 articles; avg candidates/article:1.8
**tot articles with candidates: 789; avg candidates/article: 8.2**

cswiki

interlinks -- found for: 600 articles; avg candidates/article:14.2
wikidata image -- found for: 23 articles; avg candidates/article:1.0
wikidata category -- found for: 95 articles; avg candidates/article:27.6
structured data -- found for: 23 articles; avg candidates/article:3.9
**tot articles with candidates: 606; avg candidates/article: 18.6**

kowiki

interlinks -- found for: 446 articles; avg candidates/article:16.8
wikidata image -- found for: 51 articles; avg candidates/article:1.0
wikidata category -- found for: 105 articles; avg candidates/article:42.4
structured data -- found for: 30 articles; avg candidates/article:18.7
**tot articles with candidates: 448; avg candidates/article: 28.0**
  • Main findings:
    • Between 30% (enwiki) and 78 (viwki)% of the unillustrated articles have at least one matching image.
    • We find, in average, a lot more candidates for smaller wikis.

Thanks for the detailed update @Miriam - this looks very promising for our purposes!

Wow, @Miriam! Your detailed updates are great and really help us follow along. I agree that this is promising. Are we at the point that we can try to validate the results?

I'm thinking we could do that if you post the resulting dataset. Something like one row per article URL with its best image match URL, image source, and whether an icon/logo was suppressed? I think that would be 6,000 rows (1,000 articles x 6 languages), right? If you sort it randomly inside each of the languages, then we can give it to our ambassadors to validate. They can go down the list (maybe for the first 100 or so) and mark whether the image is a good match or not.

Thanks @MMiller_WMF!
I think I can generate what you are asking by the end of next week. I am right now working on finding what you call the "best image match" - i.e. the top among the N image candidates for a given page. To do this, I am thinking of using a mixture of existing information (e.g. whether an image is a page image or a Wikidata image), and computer vision tools to further refine the selection by quality/content.

Sounds good, @Miriam. Right now is a good time for us to give this validation task to our ambassadors, because their to-do lists are relatively light. Therefore, I think it would be good if we can give them something next week, even if it's not complete. I consider this to be a proof-of-concept to see if we are on the right track, and we'll be able to plan lots of future improvements.

@Miriam -- someone pointed out to me today that the Page Previews extension uses a "Lead Image API" to identify which of the several images in an article is the "lead image". I wanted to point this out in case it's not part of your thinking yet: perhaps something there can help determine which of the images in cross-language article is "best".

Hi @MMiller_WMF . I did a first pass on refining the image candidates.

Method: For each unillustrated article, I gathered all image candidates and found the top image + some additional candidates:

  • If there is an image coming from Wikidata, I choose that one as the top image
  • If the page is about a person (instance of 'Q5'), I eliminate all the images without faces based on the MTCnn face detector
  • To quickly solve the problem of icons and logos, I get rid of all the SVG and PNG images. This is an extremely conservative approach, and there is massive room for improvement. However, this is very much needed if we want to increase accuracy.
  • If there is more than one image candidate left after the above refinement, I choose the top image based on photographic quality, estimated via our computer vision classifier.

Results: The remaining articles having at least 1 image candidate after the above filtering are much less than the previous step:

**frwiki**: 257/1000 articles with image candidates, average candidates per article: 3.603112840466926
**cswiki**: 355/1000 articles with image candidates, average candidates per article: 9.222535211267605
**kowiki**: 281/1000 articles with image candidates, average candidates per article: 16.555160142348754
**enwiki**: 87/1000 articles with image candidates, average candidates per article: 6.390804597701149
**viwiki**: 140/1000 articles with image candidates, average candidates per article: 16.8
**arwiki**: 223/1000 articles with image candidates, average candidates per article: 8.95067264573991
  • Important: this might be an over-estimation of the available images due to licensing issues: sometimes, the candidate images are not available on Commons, only on one specific Wikipedia language edition, due to licensing issues, see for example this image

Output Lists: I saved the output image candidate lists in the following tab-separated format:
Wikidata id Wikipedia page id Page title Top image Other candidates (|-separated) Notes

  • The notes section logs the steps of refinement that the image candidates went through, for example: Starting with 128 candidates; Wikidata image; Person: dropped 102 images without faces; - remaining: 26; Dropped 0 SVG/PNG images - remaining: 26; Top image quality:0.6372377
  • When the top image quality is 0, it might mean that the image cannot be found on Commons, likely because of the licensing issues mentioned above.
  • Articles in each list are ranked by total number of image candidates found.
  • Important: as mentioned earlier, since the data about unillustrated articles is at least 1 month old, some articles might have been already illustrated in the mean time.

Please let me know if you need further assistance on my end, happy to modify some parts, or render the lists in a more suitable format for the ambassadors.

@Charlotte @leila FYI.

Thanks @Miriam - is there a programmatic way to exclude the images with such licensing issues? This would be a great help for our purposes, so we do not suggest images that users cannot actually add.

I should point out a potentially significant technical hurdle that I'm not seeing covered here. (and apologies if I missed it!)

The hurdle is: Suppose we present the user with an article that is missing an image, and we suggest an image to add to the article; but then, where exactly is the user supposed to add it? (i.e. at which point in the article?)
The concept of a "lead image" does not really exist (the lead image is chosen programmatically from existing images in the article). The image needs to be integrated into the article, whether it's added to the infobox of the article, or put inline in a specific paragraph, or put into a "gallery" section, and so on. But it can't just be an image that sits at the top without any context. What is our thinking for the user experience of structuring the image into the article?

@Miriam -- thanks for posting this! I started looking at the data a little bit, but then I kept going because it was so interesting. I went through 40 random rows from the Czech Wikipedia file and classified them, along with notes on the classifications. I only looked at the "top image" candidate. I think this exercise, even though it was only 40 rows, surfaced a bunch of important considerations, and I wanted to send them over to you in case there were simple changes you want to make before we ask ambassadors to do a deeper evaluation. What do you think?

My main takeaways are:

  • I think this method has lots of potential, but we are clearly going to bump into many algorithmic and design challenges.
  • There are many good recommendations in the set of 40 that I looked at, but most of them are probably not good.
  • There are many recommendations that would be good as long as they have a good caption. For instance, recommending a photo of the director of a movie. The question (and challenge) is: how will the user receiving the recommendation know that person is the director of the movie and write an appropriate caption?
  • There are still some cases that I hope we can find ways of excluding. For instance, a wiki might use a particular image of a flower in its navbox that goes on every flower article. Therefore, it is recommended on a different wiki for a different flower. These situations result in misleading images, because a user might receive this recommendation, see that it is a flower, and agree to put it on an article about a flower -- but it would be the wrong flower.

I classified the cases like this:

2 = easily a good recommendation (12/40)
1 = a good recommendation, but the user would need context and to write a good caption (9/40)
0 = not good recommendation (9/40)
-1 = looks good at first, but is actually misleading (4/40)
? = can't find this image on Commons (6/40)

Regarding @Charlotte's point -- those images that I can't find on Commons, are those the ones that you're saying have licensing issues? I agree that we should exclude those, and only use images from Commons.

My list is in this spreadsheet.

@Dbrant -- regarding where in the article to put the image, I think @RHo was thinking that we could use some rule like "if the article has an infobox then the image is added to the infobox, but if not then the image is added to the top of the article after the title." @RHo could chime in more. What do you think of that, @Dbrant?

@Dbrant -- regarding where in the article to put the image, I think @RHo was thinking that we could use some rule like "if the article has an infobox then the image is added to the infobox, but if not then the image is added to the top of the article after the title." @RHo could chime in more. What do you think of that, @Dbrant?

Hi @Dbrant, yes this was my simple initial thought, esp. since Growth is only looking initially at adding images to completely unillustrated articles. We could also consider a follow on step if an image is added, which asks users to confirm or alter its placement. That might be a good way to help users level up in learning to edit, though I'm guessing we may start with simply adding to infobox/top of article initially.

Hi all, thank you for your comments.

  • There are many recommendations that would be good as long as they have a good caption. For instance, recommending a photo of the director of a movie. The question (and challenge) is: how will the user receiving the recommendation know that person is the director of the movie and write an appropriate caption?

Yes that is a very good point.
Question here: would this process be part of a normal "add an image" task, where you look for available free images and you add them to the article, together with a bit of contextual information? Also, would this problem be eased if we recommend images only for topics that are of interest of the editor? E.g., if someone is an expert in cinema, they would be likely know that the director of a movie is X, or know where to find that information.

  • There are still some cases that I hope we can find ways of excluding. For instance, a wiki might use a particular image of a flower in its navbox that goes on every flower article. Therefore, it is recommended on a different wiki for a different flower. These situations result in misleading images, because a user might receive this recommendation, see that it is a flower, and agree to put it on an article about a flower -- but it would be the wrong flower.

I agree this is not good. This is a corner case of the icon/flag/svg/very frequent image problem. I am suggesting a potential way to reduce this problem below.

I classified the cases like this:

2 = easily a good recommendation (12/40)
1 = a good recommendation, but the user would need context and to write a good caption (9/40)
0 = not good recommendation (9/40)
-1 = looks good at first, but is actually misleading (4/40)
? = can't find this image on Commons (6/40)

So this is telling me that, if we exclude images which we are not able to find on Commons, the accuracy is (21/34) a bit more than 3/5, so 60%. From a scientific perspective this is not bad, because the accuracy of a random image recommender system would be 1 in 60Million -- however this is not the point, we need to make sure that accuracy is as close as 100% as we can. So a few things here:

  • To also address @Charlotte's point: there is no problem in excluding those images that we can't find on Commons, we just need to check that their url exists. I kept them for evaluation and coverage purposes only.
  • There is a method I haven't tried yet, due to lack of time and the fact that the page_props table is not on Hive (T258047): instead of retrieving all images from pages in other languages, I can retrieve the page image only, namely the main image of the article. This would definitely improve accuracy, because we won't have those those misleading images in the candidates pool. However, the coverage might drop substantially.
  • If we like to try the method above, it will require a bit of time for data processing, because I will have to go through the SQL replicas instead of working on Hive. But happy to give it a try in the next week (the following 2 weeks I will be off) or so, if you think that makes sense.

@Miriam -- okay, got it.

Question here: would this process be part of a normal "add an image" task, where you look for available free images and you add them to the article, together with a bit of contextual information? Also, would this problem be eased if we recommend images only for topics that are of interest of the editor? E.g., if someone is an expert in cinema, they would be likely know that the director of a movie is X, or know where to find that information.

My answer to this is: I don't know yet. @RHo and I haven't really talked about how the design should work, or what to do about captions, or anything. It's pretty clear that's going to be important to figure out. Regarding topics of interest, we'll do that using ORES topic models the same way we are already allowing newcomers to choose suggested edits based on topics, so yes, that should help.

It sounds like my list of notes and findings don't surface any clear low-hanging fruit to change, except for the excluding of images not on Commons. Would you be able to re-export the datasets, excluding those kind of images, and then I can give everything to the ambassadors for them to evaluate and make more notes, while you are off for two weeks?

I don't think we should try the "page image" strategy yet -- maybe we should after we see what the ambassadors come back with. How does that sound?

Hi @MMiller_WMF, please find the new lists below. I made the following changes:

  • Removing Images. I removed all images which are not available on Commons - and logged it in the notes. This reduces the number of articles with matching candidates:
**frwiki**: 180/1000 articles with image candidates, average candidates per article: 3.411111111111111
**cswiki**: 293/1000 articles with image candidates, average candidates per article: 7.7815699658703075
**kowiki**: 235/1000 articles with image candidates, average candidates per article: 15.13191489361702
**enwiki**: 68/1000 articles with image candidates, average candidates per article: 6.367647058823529
**viwiki**: 132/1000 articles with image candidates, average candidates per article: 14.136363636363637
**arwiki**: 178/1000 articles with image candidates, average candidates per article: 8.97191011235955
  • Accuracy Improvement. To help removing misleading images, I came up with a short-term solution. I calculated an image importance feature, based on the number of sources (wikidata, commons category, other wikipedia articles, structured data tags) where an image candidate can be found. The number of sources is logged as "Max usage of top image" in the notes. When this number is higher than one, it is a signal that there is a strong link between the image and the unillustrated article.
  • Potential further improvement. Another way to eliminate misleading images, a sort of low-hanging fruit, is to generate a list of images occurring in more than X (100?) articles for each language (right now, we only have this data for the 6 languages we are analyzing). This would take shorter time compared to looking at the page image, but would require some iterations which I can do next week before leaving. Please let me know if you want me to try this.

@Miriam -- I added a comment about the results of the evaluation here: T260857#6444769. Please check them out! Next, let's you and I and @Charlotte talk about what sorts of improvements can be made and on what timelines.

@MMiller_WMF : As discussed yesterday over Slack, I ran a test to understand the coverage of a simpler, more conservative approach for image suggestion.
I computed how many unillustrated articles we would be able to illustrate by just using the images (P18 property or Commons Category) in their corresponding Wikidata item. This is a very conservative approach where we expect the accuracy to be pretty high. Please find the statistics about raw numbers and % of unillustrated articles having images in Wikidata below:

frwiki
number of unillustrated  articles  with Wikidata image or commons category: 57597
Percentage of easily solvable pages: 0.05566046123278881
kowiki
number of unillustrated  articles with Wikidata image or commons category: 39413
Percentage of easily solvable pages: 0.13333852076891328
cswiki
number of unillustrated  articles  with Wikidata image or commons category: 23447
Percentage of easily solvable pages: 0.13265029022731645
arwiki
number of unillustrated  articles with Wikidata image or commons category: 33329
Percentage of easily solvable pages: 0.05556991298287502
viwiki
number of unillustrated  articles with Wikidata image or commons category: 98974
Percentage of easily solvable pages: 0.10661464516003938
enwiki
number of unillustrated  articles with Wikidata image or commons category: 130873
Percentage of easily solvable pages: 0.042392217657847224

Also, to give an idea of the bias in the data, I computed the main topics for the unillustrated articles which can be solved using Wikidata images. Below you can find the top-5 topics, and the number of unillustrated articles with Wikidata images in each topic.
frwiki
"human" 7285
No Label 3985
"events in a specific year or time period" 2961
"taxon" 2190
"calendar month of a given year" 1597
kowiki
"human" 6523
No Label 3560
"year" 1935
"film" 1562
"taxon" 1214
cswiki
No Label 2980
"human" 2092
"year" 1616
"family name" 807
"film" 652
arwiki
"human" 5446
No Label 3833
"events in a specific year or time period" 3774
"year" 1916
"bilateral relation" 1725
viwiki
"taxon" 38032
"commune of France" 29841
"comune of Italy" 4940
"human" 3853
"year" 1930
enwiki
"human" 26215
"family name" 19757
"taxon" 8401
"events in a specific year or time period" 7608
No Label 5500

I made a quick bias analysis for the results of this simple solution.
For all articles for which we can have image recommendations, I selected the biographies, and computed the distribution across different genders, occupations, and continent.
Please find attached some plots

Here is the list of image candidates using Wikidata only.
I mixed image candidates obtained by taking the Wikidata image + image candidates obtained by picking at random an image from the Commons category linked from Wikidata. The source of the candidate is listed in the notes.


Hi @Dbrant, yes this was my simple initial thought, esp. since Growth is only looking initially at adding images to completely unillustrated articles. We could also consider a follow on step if an image is added, which asks users to confirm or alter its placement. That might be a good way to help users level up in learning to edit, though I'm guessing we may start with simply adding to infobox/top of article initially.

It might also be wise in the long-term to focus on articles which do not have any infobox-like template at all, if there are enough such cases to provide suggestions. In general, it may not be very productive to manually insert wikitext into articles which could achieve the same result or better simply by configuring the respective templates to use the P18 image from Wikidata: you risk adding a lot of technical debt which later needs to be handled with even more edits. (I'm talking about things like adding an image link to thousands of English Wikipedia article about Polish villages, which could get such an image simply by changing one line in the template.)

@Nemo_bis -- thank you for checking out this task and weighing in! That's a good point; I didn't know that infobox templates could be configured to draw images from Wikidata (maybe others on this task did). I'm glad we have this note on the task now for when we are figuring out how to do this in earnest (our detailed planning is still several weeks away). I hope you keep following along and point out anything else you think of or notice. This project will eventually have a mediawiki.org page where we can keep the discussion going with community members.

MMiller_WMF renamed this task from Image suggestion proof-of-concept to Image matching algorithm.Jan 28 2021, 12:03 AM