Page MenuHomePhabricator

Image suggestion proof-of-concept
Open, Needs TriagePublic

Description

The Growth and Android teams are both thinking about a new type of structured task for use in their "suggested edit" feeds. The task would involve newcomers adding images to articles that don't have any.

  • For Android, it would likely be that a user is presented with an image and asked if it is appropriate for illustrating a given article.
    • The core use case is that article X in English has a lead image, but does not have a lead image in Hindi. So we ask a user who speaks both English and Hindi to verify that it is ok to copy the lead image from English to Hindi.
    • A secondary use case is to copy an image from article X to related article Y, whether within the same language or between languages.
  • For Growth, it would likely be that a user is on an article, and receives a suggestion for an image to illustrate that article.

This task is for @Miriam to attempt a simple approach for suggesting these article/image pairs using Wikidata and interlanguage links (and whatever other methods she recommends). The output might look like:

For a set of article titles lacking any images, list one or more image suggestions from Commons, perhaps ranked in some order. Perhaps this could be done for 100 random unillustrated articles in each of these wikis: enwiki, frwiki, arwiki, kowiki, cswiki, viwiki.

We can take these suggestions and ask native speakers to evaluate whether they are good matches, and then have a sense for the accuracy of our approach.

It would also be valuable to know for how many articles in each of those wikis suggestions can be made, so we know how deep the pool of suggestions might be in a given wiki.

Event Timeline

Restricted Application added subscribers: revi, Aklapper. · View Herald TranscriptJun 22 2020, 11:49 PM
MMiller_WMF updated the task description. (Show Details)Jun 22 2020, 11:51 PM
MMiller_WMF added subscribers: Ramsey-WMF, Abit.

@Miriam -- here's the task that we talked about making so you could give this approach a try. What do you think? Does this sound doable? On what timeline would you prefer?

@Charlotte -- did I accurately represent Android's plans and needs here? Please feel free to edit the task description.

@Ramsey-WMF and @Abit -- Miriam volunteered to generate some results with this simple approach so that we could take a look. Please follow along if you want!

RHo updated the task description. (Show Details)Jun 23 2020, 8:08 AM
RHo added a subscriber: mwilliams.Jun 23 2020, 8:11 AM
@Miriam -- here's the task that we talked about making so you could give this approach a try. What do you think? Does this sound doable? On what timeline would you prefer?

@Marshall thanks for putting this together. The task sounds feasible to me.
As discussed, the method can be built with different levels of complexity. The basic level involves language interlinks + structured search, with some light refinement based on computer vision.
I should be able to work on this task in the first half of Q1. @leila FYI

Charlotte updated the task description. (Show Details)Jun 24 2020, 4:05 PM

Thanks @MMiller_WMF - I updated the task description to reflect our use cases. Glad to hear that this looks feasible @Miriam - image tools are pencilled into the Android roadmap for Q2, so it looks like we might be the first to work with this model. Happy to talk through Android-related specifics if that's useful.

JMinor added a subscriber: JMinor.Jun 29 2020, 4:16 PM

Update!
I worked on estimating the number of pages missing images for each language using the imagelinks table.

  • One issue is that the imagelinks table stores, for each page, all images, including icons, flags, and wikiproject and Wikimedia project logos.
  • These images are not informative of the article, as they can be found in most Wikipedia pages and they mostly indicate presence of metadata. For example the image below is present in thousands of article in frwiki as it indicates the result of a disambiguation
  • We need a way to exclude these images as part of the image counts: we need a white list of allowed images, so that if an article only have these very frequent images, it is counted as not having images.
  • To generate this list, i counted the frequency of all images in a Wikipedia, and put images in the white list if they are present in more than 1/2000 of the pages, or in 100 or more pages. These threshold are arbitrary and mainly decided by eyeballing the results. We might want to have some tradeoff plots to understand the impact of this threshold. Examples of whitelisted images for frwiki:

Results:

arwiki

total pages:1050283
pages without images:599286
% without images:0.5705947825490844
allowed images:2769

frwiki

total pages:2231963
pages without images:1036098
% without images:0.4642093081292118
allowed images:5183

koiki

total pages:502116
pages without images:289668
% without images:0.5768945821284325
allowed images:1283

enwiki

total pages:6112529
pages without images:3124728
% without images:0.5112005194576582
allowed images:11234

viwiki

total pages:1249195
pages without images:931057
% without images:0.7453255896797537
allowed images:1586

cswiki

total pages:457829
pages without images:176739
% without images:0.3860371448728674
allowed images:629

Thanks for the update, @Miriam! The most important thing I'm seeing in your results is that very large swaths of these wikis are unillustrated: with these numbers, we should have plenty of articles of every topic for newcomers to add images to.

Regarding the "allowed images", I think your method of excluding them makes sense to me. In the numbers you posted, when it says like allowed images: 629, does this mean it's 629 distinct images that were allowed? Or distinct articles?

Related to the "allowed images", I think that we are generally more interested in accuracy than coverage. With so many articles that are unillustrated, I was thinking that if we naively restrict our suggestions to articles that have no images of any sort, there would still be lots of recommendations. Or is it that large percentages of the articles have those "allowed images"?

And what are your next steps in general, @Miriam?

Hi @MMiller_WMF, thanks for the feedback!

Thanks for the update, @Miriam! The most important thing I'm seeing in your results is that very large swaths of these wikis are unillustrated: with these numbers, we should have plenty of articles of every topic for newcomers to add images to.

Yes, one issue is - for how many of those will we have candidate images (see next steps)?

Regarding the "allowed images", I think your method of excluding them makes sense to me. In the numbers you posted, when it says like allowed images: 629, does this mean it's 629 distinct images that were allowed? Or distinct articles?

629 distinct images are detected as frequent (logos/flags/icons) and added to the whitelist.

Related to the "allowed images", I think that we are generally more interested in accuracy than coverage. With so many articles that are unillustrated, I was thinking that if we naively restrict our suggestions to articles that have no images of any sort, there would still be lots of recommendations. Or is it that large percentages of the articles have those "allowed images"?

I initially tried the conservative approach, however the majority of the unillustrated articles have those icons and logos which we include in the "allowed images".
If accuracy is key, we can increase the threshold of the number of pages containing an image in the whitelist, so to make sure that those "allowed" images are very frequent. The number of articles which are completely free of images is much lower, and varies across wikis, as it depends on the language-specific policies for logo/icon usage, see the stats below:

frwiki

total pages:2231963
pages without images:68
% without images:3.046645486506721e-05

enwiki

total pages:6112529
pages without images:621083
% without images:0.10160818868916614

viwiki

total pages:1249195
pages without images:55919
% without images:0.04476402803405393

arwiki

total pages:1050283
pages without images:95
% without images:9.045181155936067e-05

kowiki

total pages:502116
pages without images:51566
% without images:0.10269738466808466

cswiki

total pages:457829
pages without images:47661
% without images:0.10410218662426364

And what are your next steps in general, @Miriam?

The next steps would be:

  1. Find information about Wikidata items of unillustrated articles.
    • Item ID and Label
    • Instance-of (which class does the article belong to?)
    • Item Image (this is to estimate the %of unillustrated articles which can be resolved using the WD image)
    • Page links (pages in other languages linked to the WD item)
    • Commons category (some items are linked to a category in Wikimedia Commons)
  2. For each Wikidata item, get image candidates:
    • Get images on pages linked to the WD item - imagelinks
    • Get images in Commons tagged with the Wikidata item (@EBernhardson what is the fastest way to do this? Do we have anything available in Hive? Or - could I directly retrieve MediaSearch results somewhere?)
    • Get images in categories linked to the WD item
  3. For each Wikidata item, refine the image candidate list:
    • Find most relevant based on string similarity
    • If the item is a person, run a face detector to make sure that the image is a portrait (we can have more of these kind of constraints)
    • Rank images by quality based on quality classifier.
  1. For each Wikidata item, get image candidates:
    • Get images in Commons tagged with the Wikidata item (@EBernhardson what is the fastest way to do this? Do we have anything available in Hive? Or - could I directly retrieve MediaSearch results somewhere?)

Currently the only way to access them would probably be fetching the CirrusSearch dumps and loading them into hive. I have a tiny spark script (P11997) that massages the cirrus dump format into something spark can natively read. From there it can be stored to hive and then queried as normal. The dumps also have to be copied over from the dump servers, and ideally split up into many input files to allow parallelism of the read, but it should finish eventually even when fed the initial file as one giant piece as well i think.

Thanks @EBernhardson, I'll take a look! Fyi, @nettrom_WMF created T258834 to get a table of structured data entities on Commons in Hive.

  1. Find information about Wikidata items of unillustrated articles.
    • Item ID and Label
    • Instance-of (which class does the article belong to?)
    • Item Image (this is to estimate the %of unillustrated articles which can be resolved using the WD image)
    • Page links (pages in other languages linked to the WD item)
    • Commons category (some items are linked to a category in Wikimedia Commons)

Update: - I just completed this first step. I am only missing pagelinks, which I will incorporate to next step.

  • I extracted the Wikidata ID, label and properties for the top-500'000 unillustrated articles by length ( I couldn't do it for all due to comp resources limitation)
  • The most interesting finding is that a non-negligible percentage of those articles already has an image attached to the Wikidata item, or a commons category linked to the Wikidata item:

frwiki

number of wikidata items with images: 8120
number of wikidata items with commons category: 25663

cswiki

number of wikidata items with images: 7099
number of wikidata items with commons category: 20005

enwiki

number of wikidata items with images: 8833
 number of wikidata items with commons category: 19974

viwiki

number of wikidata items with images: 39261
number of wikidata items with commons category: 45433

arwiki

number of wikidata items with images: 6195
number of wikidata items with commons category: 26553

kowiki

number of wikidata items with images: 20776
number of wikidata items with commons category: 31840
  • Note that to run these experiments at scale, I have to work on hive data rather than the SQL replicas. The tables I am working with are updated once a month, at the beginning of the month. This means that at the time of the analysis, some of the articles which were unillustrated at the beginning of the month, might already have been resolved. This is especially true this month as July was the month of the #WPWP campaign

P.S. Top topics of articles missing images, by language (topic = value of Wikidata "instance of" property)
kowiki
"television series": 3361
"taxon": 4087
"film": 12093
"Wikimedia disambiguation page": 19231
"human": 34802
arwiki
"bilateral relation": 12588
"sports season": 12782
"village": 21848
"asteroid": 25795
"human": 196864
viwiki
"commune of France": 10791
"asteroid": 11867
"village in Turkey": 14077
"human": 27665
"taxon": 321621
frwiki
"taxon": 10012
"sports season": 12577
"film": 22613
"asteroid": 42014
"human": 166133
enwiki
"sporting event": 7191
"tennis event": 12566
"sports season": 31474
"Wikimedia list article": 39924
"human": 171394
cswiki
"sports season": 2459
"film": 4293
"album": 4603
"Wikimedia disambiguation page": 16909
"human": 41733

  1. For each Wikidata item, get image candidates:
    • Get images on pages linked to the WD item - imagelinks
    • Get images in Commons tagged with the Wikidata item
    • Get images in categories linked to the WD item

Update on step 2. After generating lists of unillustrated Wikipedia articles, and extracting some properties on the top 500k by length, I focused on finding image candidates for such unillustrated articles:

  • First, I created a global whitelist of images which are very frequent across the 6 languages such as icons and logos. The purpose of this step is to avoid collecting such whitelisted images as potential candidates for unillustrated articles. Note: This step has to be refined to include frequent images across *all* languages.
  • As proof of concept, I took 1000 random unillustrated articles for each language, and searched for image candidates across different sources:
    • Image Interlinks For each of the unillustrated pages, e.g. 'Barack Obama' in English, I collected via Wikidata the corresponding articles in other languages e.g. 'Barack Obama' in Spanish, and extracted, if any, the images in such articles (e.g. President_Barack_Obama.jpg), excluding the whitelisted images. This was done using the mediawiki-imagelinks and mediawiki-pagelinks table in Hive.
    • Wikidata Image From the previous step, I had information about the image (P 18) linked to the Wikidata item of each unillustrated page.
    • Wikidata Commons Category From the previous step, I had information about the Commons category (P 373) linked to the Wikidata item of each unillustrated page. I used the categorylinks table in the mysql replicas (we don't have this table in Hive) to get the list of images in the Commons category specified for a page/QID, when present.
    • Structured Data For each Wikidata item of unillustrated articles, I searched for Commons images tagged with the corresponding QID in the 'depicts' statement. To do so, I used the Mediawiki Commons Api (example query). This is a temporary solution while we wait to have structured data on Hive (T258834)

Below are some stats on the volume of image candidates retrieved for 1000 unillustrated articles, broken down by source type.

  • Main findings:
    • Between 30% (enwiki) and 78 (viwki)% of the unillustrated articles have at least one matching image.
    • We find, in average, a lot more candidates for smaller wikis.
    • Interlinks are the largest source of image candidates, structured data the smallest.
    • [not reported below] Articles with matching image candidates are mostly about "human", "taxon", "comune", and "year". These stats might not be 100% reliable as the large majority of articles do not have a 'P31' associated to their Wikidata item, so we don't really know their class.
  • Limitation: these numbers might be an overestimation as some matching images might still be icons or logos.

enwiki

interlinks -- found for: 366 articles; avg candidates/article:4.5
wikidata image -- found for: 1 articles; avg candidates/article:1.0
wikidata category -- found for: 7 articles; avg candidates/article:33.1
structured data -- found for: 5 articles; avg candidates/article:1.8
**tot articles with candidates: 368; avg candidates/article: 5.1**

arwiki

interlinks -- found for: 501 articles; avg candidates/article:7.2
wikidata image -- found for: 12 articles; avg candidates/article:1.0
wikidata category -- found for: 56 articles; avg candidates/article:26.7
structured data -- found for: 11 articles; avg candidates/article:14.0
**tot articles with candidates: 507; avg candidates/article: 10.4**

frwiki

interlinks -- found for: 568 articles; avg candidates/article:6.5
wikidata image -- found for: 4 articles; avg candidates/article:1.0
wikidata category -- found for: 18 articles; avg candidates/article:11.3
structured data -- found for: 3 articles; avg candidates/article:4.3
**tot articles with candidates: 573; avg candidates/article: 6.8**

viwiki

interlinks -- found for: 787 articles; avg candidates/article:6.6
wikidata image -- found for: 33 articles; avg candidates/article:1.0
wikidata category -- found for: 49 articles; avg candidates/article:25.4
structured data -- found for: 21 articles; avg candidates/article:1.8
**tot articles with candidates: 789; avg candidates/article: 8.2**

cswiki

interlinks -- found for: 600 articles; avg candidates/article:14.2
wikidata image -- found for: 23 articles; avg candidates/article:1.0
wikidata category -- found for: 95 articles; avg candidates/article:27.6
structured data -- found for: 23 articles; avg candidates/article:3.9
**tot articles with candidates: 606; avg candidates/article: 18.6**

kowiki

interlinks -- found for: 446 articles; avg candidates/article:16.8
wikidata image -- found for: 51 articles; avg candidates/article:1.0
wikidata category -- found for: 105 articles; avg candidates/article:42.4
structured data -- found for: 30 articles; avg candidates/article:18.7
**tot articles with candidates: 448; avg candidates/article: 28.0**
  • Main findings:
    • Between 30% (enwiki) and 78 (viwki)% of the unillustrated articles have at least one matching image.
    • We find, in average, a lot more candidates for smaller wikis.

Thanks for the detailed update @Miriam - this looks very promising for our purposes!

Wow, @Miriam! Your detailed updates are great and really help us follow along. I agree that this is promising. Are we at the point that we can try to validate the results?

I'm thinking we could do that if you post the resulting dataset. Something like one row per article URL with its best image match URL, image source, and whether an icon/logo was suppressed? I think that would be 6,000 rows (1,000 articles x 6 languages), right? If you sort it randomly inside each of the languages, then we can give it to our ambassadors to validate. They can go down the list (maybe for the first 100 or so) and mark whether the image is a good match or not.

Thanks @MMiller_WMF!
I think I can generate what you are asking by the end of next week. I am right now working on finding what you call the "best image match" - i.e. the top among the N image candidates for a given page. To do this, I am thinking of using a mixture of existing information (e.g. whether an image is a page image or a Wikidata image), and computer vision tools to further refine the selection by quality/content.

JMinor removed a subscriber: JMinor.Wed, Aug 5, 5:12 PM

Sounds good, @Miriam. Right now is a good time for us to give this validation task to our ambassadors, because their to-do lists are relatively light. Therefore, I think it would be good if we can give them something next week, even if it's not complete. I consider this to be a proof-of-concept to see if we are on the right track, and we'll be able to plan lots of future improvements.