Page MenuHomePhabricator

Manually test image recommendations POC results
Closed, ResolvedPublic

Description

Using the tool built for this purpose in T273062, conduct the following test of the image recommendations POC:

  • Evaluate results on Arabic, Cebuano, English, Vietnamese, Bengali and Czech wikis
  • Evaluate 500 unillustrated articles from each wiki
  • For each result for each unillustrated article, manually decide whether the match is good, okay, or bad. Evaluators also have the option to choose "unsure" if they're not confident in their selection.
  • Based on the percentage of good matches, evaluate what the likely revert rate would be for bots adding these images to articles
  • Evaluate which match source(s) provide the best matches (Wikidata, interwiki, Commons category, MediaSearch)
  • Evaluate whether MediaSearch provides valuable results where the other sources have none
  • Evaluate performance based on logged response time
  • Evaluate what percentage of matches are offensive or NSFW, to help decide whether we should put some kind of safe search filter on the recommendations

The estimated time of work is 3 hours for the 500 images. However, is the 3 hours are passed without finishing the test, please leave a comment.

Event Timeline

There’s been a short delay and we now expect to be ready to start evaluation on Tuesday April 20th. Thanks for your patience all!

There's been another delay due to some bugs and now we're looking at Wednesday April 21st (hopefully) or definitely by Thursday the 22nd.

@Urbanecm_WMF @Dyolf77_WMF @PPham @Ankan_WMF We are ready to go! The tool is here:

https://image-recommendation-test.toolforge.org/

Please let us know in this task if you encounter any problems or have any questions. Thanks!

@Urbanecm_WMF @Dyolf77_WMF @PPham @Ankan_WMF I spoke too soon, there's another issue. Please pause testing. My apologies. I'll update shortly.

@Urbanecm_WMF @Dyolf77_WMF @PPham @Ankan_WMF I spoke too soon, there's another issue. Please pause testing. My apologies. I'll update shortly.

We're once again ready to go. Sorry for the delays!

One note: when you enter the tool, you might note that the first article after choosing your language is in English. This is because the tool is pre-fetching the first article. To avoid this, you can go directly to your language:

Arabic: https://image-recommendation-test.toolforge.org/?uselang=ar
Bengali: https://image-recommendation-test.toolforge.org/?uselang=bn
Czech: https://image-recommendation-test.toolforge.org/?uselang=cs
Vietnamese: https://image-recommendation-test.toolforge.org/?uselang=vi

Thanks all!

@CBogen Hi, I noticed that the tool is displaying many duplicates. And sometimes in image descriptions (in Wikimedia Commons) there are words that match with the article, but the image itself is not at all in the scope of the article.

Hello @CBogen! I would like to put a few notes here about the general feeling i have when testing the image reccomendation system you created

  • I mention this for almost every algorithm: you likely dont want disambigs to be considered at all. They are used for stuff like "John Smith", a name that is likely used by multiple notable people. For this article, image of any John Smith would be unsuitable, because it is not displaying the topic at all :). In 99.9% cases, disambigs just should have no image at all
  • The tool looks to suggest some images based on word similarity. That leads to funny cases, such as "Vánek" (breeze) vs "Vaněk" (a surname). Those words are very similar, but actually mean different things. There are other word pairs as well - for example, "Vláda" (government) vs "Vláďa" (a name) vs "Vlada" (yet another name). So far, "vánek" vs "vaněk" is the only case that appeared in suggestions.

Im not yet done with the evaluation, so it is possible I will add some more comments as I go through the images.

Hope it helps.

@CBogen Hi, I noticed that the tool is displaying many duplicates. And sometimes in image descriptions (in Wikimedia Commons) there are words that match with the article, but the image itself is not at all in the scope of the article.

This is helpful feedback. We are expecting to see some bad matches like that, and the results from this evaluation (for example, marking those images that are not in scope as bad matches) will help us improve.

When you say duplicates, what do you mean? Duplicate articles? Duplicate images? Are you seeing these duplicates after already marking them as good or bad matches?

  • I mention this for almost every algorithm: you likely dont want disambigs to be considered at all. They are used for stuff like "John Smith", a name that is likely used by multiple notable people. For this article, image of any John Smith would be unsuitable, because it is not displaying the topic at all :). In 99.9% cases, disambigs just should have no image at all

Disambigs are supposed to be filtered out, so it's helpful to know this isn't working properly. We'll look into it, but for now you can mark them as bad matches.

  • The tool looks to suggest some images based on word similarity. That leads to funny cases, such as "Vánek" (breeze) vs "Vaněk" (a surname). Those words are very similar, but actually mean different things. There are other word pairs as well - for example, "Vláda" (government) vs "Vláďa" (a name) vs "Vlada" (yet another name). So far, "vánek" vs "vaněk" is the only case that appeared in suggestions.

Names are one case we're learning through this testing that we also want to filter out. But these types of bad matches are just the type of feedback we need to help improve the algorithms.

Thanks both!

When you say duplicates, what do you mean? Duplicate articles? Duplicate images? Are you seeing these duplicates after already marking them as good or bad matches?

Thanks, I mean the tool is showing these cases of suggestions:

  • the same image with the same article more than once (I marked the image with Okay),
  • the same image with different articles (Image marked bad),
  • the same article with different image suggestions (Images marked bad/good).

When you say duplicates, what do you mean? Duplicate articles? Duplicate images? Are you seeing these duplicates after already marking them as good or bad matches?

Thanks, I mean the tool is showing these cases of suggestions:

  • the same image with the same article more than once (I marked the image with Okay),
  • the same image with different articles (Image marked bad),
  • the same article with different image suggestions (Images marked bad/good).

Thanks for clarifying! Same image with different articles and same article with different image suggestions are expected, since they may be appropriate in different contexts. Same image with same article more than once sounds like an error though - @Cparle can you take a look?

As a note, same image with same article more than once may pop up if you click "unsure", because that throws it back into the queue. But it shouldn't happen if you click good or bad.

Some articles' contents can be culturally different for the user who will be using the image recommendation tool. In that case, the user needs to click on the image, and check whether it's a match or not by reading the description or checking the articles to which the image is already linked. Some users may understand due to their background knowledge, but it will not be the usual case. For these cases, is it ok to mark them as "unsure"?

Some articles' contents can be culturally different for the user who will be using the image recommendation tool. In that case, the user needs to click on the image, and check whether it's a match or not by reading the description or checking the articles to which the image is already linked. Some users may understand due to their background knowledge, but it will not be the usual case. For these cases, is it ok to mark them as "unsure"?

Thanks for the question. In this tool, we're only trying to understand if the image is a good match or not, not users' understanding. So if you know whether the match is good or not, please click good or bad, not "unsure". Only click "unsure" if you yourself are not sure if it's a good match.

Thanks, @CBogen for your suggestion!

I am reporting here as it has already been 3 hours. May I know how many images are remaining?

  • I have also found the images to be repeated often. I can remember getting an image at least three times despite selecting good/bad. This happened for many images.
  • A major portion of the images, being suggested for dates/months, are bad predictions.
  • For most of the images, I needed to visit the commons page.

The work is repetitive, but I'm learning new things. :)
Due to my network discrepancy, I think I couldn't finish testing all the images on time. It took time to load the images/visit the website. Is there any possibility that if I did the work continuously, I could have got fewer repetitions?

The latest update, as of 4:27pm UTC:

345 | ar
504 | bn - DONE
221 | ceb
114 | cs
106 | en
96 | vi

@Ankan_WMF it sounds like you might have done a lot of your work after these metrics were collected. @Cparle, can you provide updated numbers on Monday morning? Thanks!

@Ankan_WMF date/month articles are supposed to be filtered out. If you see another one, can you send a screenshot or a link to the article so we can look into that? Thanks!

@CBogen, looks like I forgot some of the images and they appeared to me as new, so I continued testing. :P But to make it clear, I found the repetitions from the very beginning.

Some examples on date/month articles: (i) Article: ৮ সেপ্টেম্বর (en: September 8)

Example_1.JPG (857×1 px, 238 KB)

(ii) Article: ২১ আগস্ট (en: August 21)

Example_2.JPG (814×1 px, 161 KB)

(iii) Article: ৮ ফাল্গুন (It's a Bengali calendar day)

Example_3.JPG (632×1 px, 179 KB)

I searched for approximately 20 images again and got these examples. The portion of suggested articles on date/month is significant.

Hi,

I've been doing it for quite long but it still hasn't ended yet, but I think I'll just post a comment here:

  • Lots of repeated matches, which is the same article and the same images. Out of 10 matches recommended, there is probably 1 repeated match (yes, it's that high). And of course it's not because I click unsure. I definitely click good/bad, but it still shows up again.
  • Sometimes the date articles still show up. I didn't take screenshots of those though.
  • The matches with highest correct rate is fauna and flora articles, I think because their scientific name is unique and so the algorithm can match them easily. Next is location article, also because of the locations name. Other than that, I don't think the rate is too high.

And so I really want to know how many are left with the Vietnamese language...

And so I really want to know how many are left with the Vietnamese language...

Sorry for the delay - @Cparle is out today, but we'll have the up to date numbers for you tomorrow.

Here are the updated numbers:

327 | ar
560 | bn - DONE
267 | ceb
112 | cs
181 | en
371 | vi

@PPham , @Dyolf77_WMF , and @Urbanecm_WMF , can you let me know how much time you've spent so far? We're supposed to be complete tomorrow, but we can extend if necessary.

I will add more 200 images today.
Update: done.

Here are the updated numbers:

327 | ar
560 | bn - DONE
267 | ceb
112 | cs
181 | en
371 | vi

@PPham , @Dyolf77_WMF , and @Urbanecm_WMF , can you let me know how much time you've spent so far? We're supposed to be complete tomorrow, but we can extend if necessary.

I think it took about 2,5 hours already (including time I spend on linkrecommendation this afternoon/evening). A lot of images take long time, because I have to open commons page and various article to be able to judge with confidence.

I'll give it some more time today, but I'm not sure I can finish it by tomorrow.

Hi,

I've been doing it for quite long but it still hasn't ended yet, but I think I'll just post a comment here:

  • Lots of repeated matches, which is the same article and the same images. Out of 10 matches recommended, there is probably 1 repeated match (yes, it's that high). And of course it's not because I click unsure. I definitely click good/bad, but it still shows up again.
  • Sometimes the date articles still show up. I didn't take screenshots of those though.
  • The matches with highest correct rate is fauna and flora articles, I think because their scientific name is unique and so the algorithm can match them easily. Next is location article, also because of the locations name. Other than that, I don't think the rate is too high.

And so I really want to know how many are left with the Vietnamese language...

I also want to add that, in some cases the image suggested by the algorithm has already been used in the article.

Hi all - thanks so much for your hard work on this. Based on @Cparle's comment in T281893, all ratings are complete.

The next step is for me to review the results and I'll provide an update with the learnings. @Cparle can you paste a link with the results in a CSV?

Hello @CBogen! I would like to put a few notes here about the general feeling i have when testing the image reccomendation system you created

  • I mention this for almost every algorithm: you likely dont want disambigs to be considered at all. They are used for stuff like "John Smith", a name that is likely used by multiple notable people. For this article, image of any John Smith would be unsuitable, because it is not displaying the topic at all :). In 99.9% cases, disambigs just should have no image at all
  • The tool looks to suggest some images based on word similarity. That leads to funny cases, such as "Vánek" (breeze) vs "Vaněk" (a surname). Those words are very similar, but actually mean different things. There are other word pairs as well - for example, "Vláda" (government) vs "Vláďa" (a name) vs "Vlada" (yet another name). So far, "vánek" vs "vaněk" is the only case that appeared in suggestions.

Im not yet done with the evaluation, so it is possible I will add some more comments as I go through the images.

Hope it helps.

Additional feedback:

There are pages that very very likely won't have an image (for instance, since fair use images are not allowed on Commons, it's unlikely that a page about an album would have an image, as such image would likely display non-free content; lists also rarely have an illustrative image)

Here's a csv of the results

Here is a summary of the findings:

langtotalgood rating% goodokay ratinggood + okay ratinggood + okay %bad ratingbad rating %
mediasearchar3614713.02%5510228.35%25971.75%
image algoar1548555.19%4513084.42%2415.58%
mediasearchbn3648423.08%8717146.98%19353.02%
image algobn22515066.67%4519586.67%3013.33%
mediasearchceb35120357.83%7828180.06%7019.94%
image algoceb19615880.61%1217086.73%2613.27%
mediasearchcs401153.74%274210.47%35989.53%
image algocs1063028.3%245450.94%5249.06%
mediasearchen494346.88%185210.53%44289.47%
image algoen1096055.05%268678.90%2321.10%
mediasearchvi51323545.81%2025549.70%25850.29%
image algovi23516771.06%2519281.70%4318.30%

The average "good + okay %" for mediasearch results in all languages was: 37.61%
The average "good + okay %" for the image algo results in all languages was: 78.23%

The image algo is performing extremely well. MediaSearch needs improvement to be ready to use.

My anecdotal experience evaluating MediaSearch results showed me that some specific improvements in determining which MediaSearch results are acceptable for the Image Recommendations API should make a big difference. We will be implementing that in the form of a confidence score in T281582, and then we will do another test.

CBogen claimed this task.

@Ankan_WMF @PPham @Dyolf77_WMF @Urbanecm_WMF -- thank you for contributing to this effort. I just want to ask if you have any other notes or thoughts or ideas to add beyond what you've already written in comments above. Anything that can help the team improve the algorithms?