Page MenuHomePhabricator

[XL] Evaluate 'depicts' annotations added via CAT
Closed, ResolvedPublic

Description

Some community members are concerned that 'depicts' annotations added via CAT (including additions via the ISA tool which uses the MachineVision extension API to get suggestions)* are doing more harm than good

To try and measure this objectively we could select a random sample of 'depicts' annotations added via Special:SuggestedTags or ISA and make an interface to rate them as "good" or "bad" in a way similar to how we've classified image suggestions in the past. Then we could allow the community (or ambassadors) to use the interface to rate the annotations, and come up with a reasonably objective idea of whether they're good or bad overall

Once that's done we can report back to the community and they can decide whether to turn CAT off

(An alternative would be to use a more complex rubric for rating "depicts", similar to https://commons.wikimedia.org/wiki/User:Rhododendrites_(WMF)/Suggested_Edits/data )


  • it looks like the part of the ISA tool that uses machine-vision suggestions has not been deployed so far, so that may be unaffected

Event Timeline

allow the community (or ambassadors) to use the interface to rate the annotations

Piling extra and pointless work such as this onto an already stretched volunteer community is pointless, bordering on harmful; as I pointed out when you raised the same suggestion on Commons, four days before you opened this ticket [1]

Community members have identified the problems and described them to you repeatedly, with ample examples, since February 2020. [2, 3, 4]

You have been asked questions - reasonable questions - about the approach time and time again since that date; these remain unanswered. [2, 3]

You now propose to take "the next few months" [5] doing nothing to actually fix the issue, while holding no meaningful discussion about the problems - which you have yet to acknowledge exist, using weaselly phrases like "Some community members are concerned" - or their solution with community members.

[1] https://commons.wikimedia.org/wiki/Commons_talk:Structured_data/Computer-aided_tagging#WMF_response

[2] https://commons.wikimedia.org/w/index.php?title=Commons_talk:Structured_data/Computer-aided_tagging/Archive_2020#Bad_tags,_nagging,_and_no_tags

[3] https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2020/02#Misplaced_invitation_to_%22tag%22_images

[4] https://commons.wikimedia.org/wiki/Commons_talk:Structured_data/Computer-aided_tagging#Large_numbers_of_trash_tags?

[5] https://commons.wikimedia.org/wiki/Commons_talk:Structured_data/Computer-aided_tagging#Update_on_Computer-aided_Tagging_-_We're_on_it!

MarkTraceur renamed this task from Evaluate 'depicts' annotations added via CAT (incl the ISA tool) to [XL] Evaluate 'depicts' annotations added via CAT (incl the ISA tool).Jul 26 2023, 5:27 PM
Cparle renamed this task from [XL] Evaluate 'depicts' annotations added via CAT (incl the ISA tool) to [XL] Evaluate 'depicts' annotations added via CAT.Aug 17 2023, 12:53 PM
Cparle updated the task description. (Show Details)

At Wikimania this week, Mariana Fossati and Sunshine Fionah Komusana of Whose Knowledge? talked about some of the challenges of using structured data to describe the images of women contributed through #VisibleWikiWomen (consent, privacy, biases in automated description): https://www.youtube.com/live/nSsVDaCJyZ8?feature=share&t=800

Because of @Pigsonthewing’s reservations about the WMF asking the community to evaluate depicts annotations, I took a random sample of depicts annotations added in 2023 via the Special:SuggestedTags, and evaluated them myself. Here are the results:

Total annotations rated1000
Annotations rated “bad”734
Annotations rated “ok”180
Annotations rated “good”86

Here’s a more detailed breakdown with reasons for all “bad” or “ok” rated images:

RatingReasonCount
BadImage is a scan of a text (or mostly-text) document, and so probably should not have a "depicts" annotation.218
BadDepicts annotation is present in image, but only as an incidental part (e.g. "road surface" for an image of a car)149
BadDepicts annotation is not present in image112
BadDepicts annotation is too general to be useful (e.g. "automotive design" for an image of a car, or "blue" for an image of the sky)90
BadDepicts annotation is abstract or invisible (e.g. "happiness" or "electricity" or "visual arts")60
BadDepicts annotation is general, when we already have a more specific annotation (e.g. "plant" when we already have "oak")50
BadDepicts annotation is a part of a pre-existing annotation (e.g. "tire" when we already have "car")25
BadDepicts annotation used in the wrong sense (e.g. the mathematical concept "slope" for an image of a hill)18
BadOther10
BadOnly part of the item described in the depicts annotation is visible (e.g. "airplane" when only a wing is visible)2
OkDepicts annotation is more general than we would like, but might be useful anyway (e.g. image of a house annotated with "building" when there are no other annotations)79
OkDepicts annotation is general when we already have a more specific annotation, but might be useful anyway (e.g. "dog" when we already have "poodle")59
OkDepicts annotation only describes one aspect of the image, but might be useful anyway (e.g. image of a cemetery annotated with "tombstone" when there are no other annotations)38
OkOther4

What next?

Only 8.6% of the “depicts” annotations added via the tool evaluate as “good”, with 73.4% evaluating as “bad”.

The CAT tool uses a “blocklist” to reject suggested annotations from google that contain images of people. If we used it to reject suggested annotations that
a) indicate the image might be a scan of a document (e.g. “text”, “document”, “line”, “calligraphy”)
b) are abstract (e.g. “happiness”, “sharing”, “color, tint and tone”)
c) are mathematical concepts (“slope”)
… then we might be able to reduce the proportion of “bad” images.

For example if we successfully detected all the document scans, and eliminated all the abstract suggestions plus eliminated “slope” as a suggestion in the test sample, we’d remove 296 “bad” images from the sample.

This leaves us with 438 “bad” images out of 704 images remaining, which is still a “bad” proportion of 62%, and a “good” proportion of only 12%.

So even if our mitigation measures are very successful, more than 6 out of 10 depicts annotations added by the tool are likely to be bad.

It’s possible that a UI redesign might reduce this, but given that we have no roadmap for this and our “good”-rated images are outnumbered by “ok”-rated images (images where the quality of the annotation is not objectively good but might be acceptable) 2:1, it seems as if turning off the tool is our best option

Comparing with the rules listed at https://commons.wikimedia.org/wiki/Commons:Depicts I find the sample analysis above to be too harsh. I would consider these as good:

Depicts annotation is present in image, but only as an incidental part (e.g. "road surface" for an image of a car) <-- That's why we have "Prominent"
Depicts annotation is too general to be useful (e.g. "automotive design" for an image of a car, or "blue" for an image of the sky) <-- As long as not "Depicts annotation is general, when we already have a more specific annotation" I would consider this OK
Depicts annotation is abstract or invisible (e.g. "happiness" or "electricity" or "visual arts") <-- We need search for abstract concepts to give enough illustrations too.
Only part of the item described in the depicts annotation is visible (e.g. "airplane" when only a wing is visible) <-- Unless we have an "airplane wing" item this is OK, because many other objects have wings.
Depicts annotation only describes one aspect of the image, but might be useful anyway (e.g. image of a cemetery annotated with "tombstone" when there are no other annotations)

In other words 425 "good". Combined with Cparle's great improvement ideas, the extension would greatly improve searchability.

(e.g. "automotive design" for an image of a car, or "blue" for an image of the sky) <-- As long as not "Depicts annotation is general, when we already have a more specific annotation" I would consider this OK

Consensus on Commons clearly disagrees. Such statements are in the process of being removed.

We need search for abstract concepts

This is the crux of the problem. "Depicts" is meant as a structured way of saying what an image shows. It is not a tool for improving general searchability. If the latter is the use case, then a new property ("keywords", say ) should be proposed.

This task is complete. Shouldn't the ticket be closed?