Page MenuHomePhabricator

Create scripts for generating lists of desired image files for labeling
Closed, ResolvedPublic

Description

We need maintenance scripts to generate the lists of images that will be submitted to $provider for labeling.

  • Group 1: Featured/valued/quality images
  • Group 2: Images used on two or more non-Commons wikis

Both of these (as well as new uploads) should adhere to the following exclusion criteria:

  • No images under 150px wide
  • No "notable artwork"
  • No NSFW images
  • No protected files

Doc: https://docs.google.com/presentation/d/14FD1q2f86nHSYC6bgKsikURlCpXueKmrETiBipV1RNM

Details

Related Gerrit Patches:
mediawiki/extensions/MachineVision : masterAdd script to create file lists based on global image usage
mediawiki/extensions/MachineVision : masterProvide for withholding "NSFW" images from being reviewed
mediawiki/extensions/MachineVision : masterRequest and store SafeSearch annotations
mediawiki/extensions/MachineVision : masterAdd filtering for max number of existing depicts statements
mediawiki/extensions/MachineVision : masterAdd script to create file list from any combo of categories and templates

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2019, 5:29 PM
Mholloway triaged this task as High priority.Sep 30 2019, 5:29 PM

@Ramsey-WMF Where did we come down on excluding images with existing depicts statements? IIRC, we were leaning toward including them anyway.

Mholloway updated the task description. (Show Details)Sep 30 2019, 5:48 PM

@Ramsey-WMF Where did we come down on excluding images with existing depicts statements? IIRC, we were leaning toward including them anyway.

Yeah. Because we have bots with some overlap on our target base of existing images, but those bots typically only add one statement, we'd like to give users the tool to enhance the files with additional statements.

On the other hand, some bots (like XRayBot) have done a pretty good job with extensive modeling and we don't need anything additional.

If we're looking for a rule set, I'd say

a.) If an image is a candidate for the stack, check and see if it has only 1 depicts statement. If yes, include it in the stack. If it has more than one, exclude that file from the stack.

b.) (This is the one we weren't sure how to do technically) - If a user using the CAT tool adds the same depicts statement that a file already has, ignore it.

Change 539981 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Add maintenance script to create file list from one or more categories

https://gerrit.wikimedia.org/r/539981

The script above can be used to create a list of unique files that are members of Category:Featured_pictures_on_Wikimedia_Commons, Category:Quality_images, or Category:Valued_images.

Mholloway added a comment.EditedSep 30 2019, 11:07 PM

After digging a bit deeper, I see that Template:Valued_image should be used rather than the corresponding category, since the vast majority of candidates are buried in subcategories (and it's easier just to use templatelinks than to traverse the category tree and aggregate results).

OTOH, Category:Quality_images appears to yield a slightly larger number of results than Template:Quality_image (223,957 vs. 218,318). I'm not sure why that would be. Perhaps it's better to stick with the category.

We should continue to use the category approach for Commons featured images. There is no directly corresponding template (it's currently indicated by a parameter passed in to the Assessments tempate).

Change 540439 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Add script to create file lists based on global image usage

https://gerrit.wikimedia.org/r/540439

b.) (This is the one we weren't sure how to do technically) - If a user using the CAT tool adds the same depicts statement that a file already has, ignore it.

Created new task T234457 to tackle that issue.

Change 540466 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Add filtering for max number of existing depicts statements

https://gerrit.wikimedia.org/r/540466

Change 540497 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Request and store SafeSearch annotations

https://gerrit.wikimedia.org/r/540497

Finishing this up depends on landing two different multi-patch branches that are awaiting review.

Change 539981 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Add script to create file list from any combo of categories and templates

https://gerrit.wikimedia.org/r/539981

Change 540439 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Add script to create file lists based on global image usage

https://gerrit.wikimedia.org/r/540439

Change 540466 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Add filtering for max number of existing depicts statements

https://gerrit.wikimedia.org/r/540466

Change 540497 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Request and store SafeSearch annotations

https://gerrit.wikimedia.org/r/540497

Mholloway moved this task from Done to In development on the Machine vision board.

Change 543477 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Provide for withholding "NSFW" images from being reviewed

https://gerrit.wikimedia.org/r/543477

Change 543477 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Provide for withholding "NSFW" images from being reviewed

https://gerrit.wikimedia.org/r/543477

Mholloway closed this task as Resolved.Oct 16 2019, 8:11 PM