Page MenuHomePhabricator

Section-Level Images task sometimes suggests images already present in the article
Open, Needs TriagePublic

Description

Problem:

  • The Section Level Image Suggestions task sometimes is suggesting images that are already in the article, because they have a different file name in Wikimedia Commons.
  • If the newcomer accepts a suggestion that adds an image that was already in the article, probably the edit would be reverted, which can be frustrating for the newcomer.

This has been detected at least twice at es.wp:

Possible Solutions?

  • Is there a way to detect this repeated images to not suggest them?
  • Should we add to the onboarding copy a recommendation for the newcomer to check if the same image is already included in the article?
  • ...

Event Timeline

Trizek-WMF subscribed.

Should we add to the onboarding copy a recommendation for the newcomer to check if the same image is already included in the article?

I think it is the best solution unless there is a way to detect the image...? (mandatory xkcd)

If it's the exact same image, the suggestion generation code could detect that by comparing img_sha1.

If it's the exact same image, the suggestion generation code could detect that by comparing img_sha1.

It is visually the same image, but in a different file format (svg vs a png). I don't think that can be detected using img_sha1.

I think that changing the onboarding to inform users about this possibility is the best we can do here -- visually comparing the two images seems to be quite complicated (we might be able to compare the image pixel-to-pixel instead of their SHA1 hash?).

In case it helps, yesterday this image was suggested to add in this article that already has the same image, with the same format (jpg) but with different file name. Also, the suggested image was cropped, so it is slightly different from the original one included in the article.

we might be able to compare the image pixel-to-pixel instead of their SHA1 hash?

There is T121797: Implement perceptual/visual image hashing/fingerprinting in MediaWiki for detection of non-exact duplicate files but I wouldn't hold my breath for it happening. So yeah, probably something to handle via onboarding.

My sense is that a copy change will have minimal impact since onboarding isn't required and I assume that even brand new editors realize they shouldn't add a duplicate image to an article. I imagine the issue is more that the new editor didn't review the full article and look at the other images before making a decision. But I'm open to changing the onboarding language if others think it will be helpful. @JFernandez-WMF - let us know what you think.

The Structured Data team also discussed this task and mentioned that in their future work (which will be focused on Commons) they can investigate image deduplication on commons upon upload, so that might subsequently help reduce such duplications.

My sense is that a copy change will have minimal impact since onboarding isn't required and I assume that even brand new editors realize they shouldn't add a duplicate image to an article. I imagine the issue is more that the new editor didn't review the full article and look at the other images before making a decision.

I think this is generally fair assumption, however, it might not be obvious that the duplicate image might be elsewhere in the article (in a different section, at the other end of the article) and that should be checked manually, rather than in the section the newcomer is currently reviewing. I think newcomers might think "the software is smart enough to not suggest duplicate images, this seems like an easy thing". That being said, I agree that the copy change probably won't be seen by a lot of our users, so I'm not saying we have to do it (just that there isn't really something we could do from our end here).

@JFernandez-WMF and I discussed. I'm still not convinced additional text will help all that much, but after reviewing the onboarding again, it seems like we could consider simply expand the third onboarding screen to help address the two main issues we've seen newcomers struggling with:

  • Adding images when there isn’t space for an additional image
  • Adding images when there is already an identical (or nearly identical) image elsewhere in the article

Screenshot 2023-07-26 at 3.39.36 PM.png (1×1 px, 89 KB)

Current language:

Look at both the article and its section
Read over the article and its section and think about whether the suggested image will help readers understand the content. Is it appropriate to display in the section?

What do you think about this language:

Review the article and the section
Consider if the suggested image will aid reader understanding. Images should only be accepted if:

  • The image is suitable for both the section and the entire article.
  • There is available space within the section for an image.
  • No similar image has been used elsewhere in the article.

Too long? Feel free to suggest improvements or a different approach.