Page MenuHomePhabricator

[L] Create script to add existing images on Commons from specific categories to the popular CAT queue
Closed, ResolvedPublic

Description

As a creator of campaigns, I want to be able to direct users to particular categories of images to tag with depicts statements using the ISA tool/CAT, so that I can increase the amount of structured data on images in a targeted area.

This task is to create a script that can be updated to point to any particular category on Commons and run through older images in that category to add them to the "popular" CAT queue. The first use case is for Category:Files_from_content_partnerships from the creators of the ISA tool:

As we are planning to work on content that has been uploaded by GLAM institutions: Would it be possible to include all the files in the following category (including sub-categories) to the maintenance script that triggers the generation of “depicted” suggestions? Category:Files_from_content_partnerships

Acceptance Criteria

  • A script is created that can be run on any category when needed to add all images from that category to the CAT "popular" queue
  • The script is run on Category:Files_from_content_partnerships
  • The script does not prioritize images in that category (e.g., uncategorized images will still maintain priority in the CAT queue as per T262857)

Testing instructions

tl;dr: run the script maintenance/createFileListFromCategoriesAndTemplates.php to generate a list of files in either a category or a template that can then be ingested by maintenance/fetchSuggestions.php
Only the categories code has changed with this patch, but you probably also ought to check that I haven't broken the template code
How the script works is essentially you give it the name of a category or a template and it writes a list of articles with that category or template (one article title per line)
So, for example, you can create a list of article titles that have template Boink with mwscript extensions/MachineVision/maintenance/createFileListFromCategoriesAndTemplates.php --outputFile=/path/to/your/file --template=Boink
You can create a list of article titles in category Bazoink and its subcats with mwscript extensions/MachineVision/maintenance/createFileListFromCategoriesAndTemplates.php --outputFile=/path/to/your/file --category=Bazoink --deepcat (remove deepcat to exclude the subcats)
You'll need to set up templates/categories locally to check this, and probably ought to set up a circular category loop to make sure that deepcat doesn't give you an infinite loop

Event Timeline

Our main need consists in being able to activate Google Vision on the images in specific categories in order to use them in an enhanced version of the ISA tool that includes tag suggestions from Google Vision (which is currently not the case for older uploads).

If this implies adding these images to the "popular" CAT queue that's ok; but it's not an important requirement in the context of our use case.

CBogen renamed this task from Create script to add existing images on Commons from specific categories to the popular CAT queue to [L] Create script to add existing images on Commons from specific categories to the popular CAT queue.Mar 24 2021, 4:41 PM

I have concerns about this approach. Structured Data on Commons is meant to be, well, "structured". It is not for "tags".

For example, a photograph of the White House in Washington DC might be tagged "white" and "building", but in terms of structured data it depicts Q35525, which is the item about that single specific building. The item tells us that the subject is a building, and that it is white in colour.

I have concerns about this approach. Structured Data on Commons is meant to be, well, "structured". It is not for "tags".

For example, a photograph of the White House in Washington DC might be tagged "white" and "building", but in terms of structured data it depicts Q35525, which is the item about that single specific building. The item tells us that the subject is a building, and that it is white in colour.

We have already done some tests with Google Vision on the ISA tool. The goal is indeed to add "depicts" statements. So far, color tags are not an issue; they don't seem to come up in the suggestions. However, suggestions such as "photograph" and the like can be problematic as they may apply, e.g. when a photo depicts photographs. In many cases, however, where the digital image represents a photograph the statement shouldn't be applied. The same goes for scans of postcards: At what level do we apply the "depicts" statement? - At the level of the scan that depicts a photograph with an image and often a frame and some text? Or just at the level of the image on the postcard? - These are issues that need to be addressed via community deliberations. We expect that the development and deployment of tools to assist with adding "depicts" statements will foster such deliberations, whose results can then be reflected in the tools themselves in order to nudge users in the direction of established shared practices.

These issues have been addressed via community deliberations; depicts statement should be at the most specific level possible.

Current policy is at:

https://commons.wikimedia.org/wiki/Commons:Depicts

and includes, (emboldening in original) ...generic "tags" should not currently be added if more specific depicts statements already exist.

There is prior discussion, for example, here:

https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2020/02#Misplaced_invitation_to_%22tag%22_images

After some poking around I see that step 1 can be accomplished using the scripts

  • maintenance/createFileListfromCategoriesAndTemplates.php
  • ... and then maintenance/fetchSuggestions.php

Hmm or maybe not. maintenance/createFileListfromCategoriesAndTemplates.php doesn't handle sub-categories

Change 701432 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/MachineVision@master] Add options to allow job createFileList job to use subcategories

https://gerrit.wikimedia.org/r/701432

Change 701432 merged by jenkins-bot:

[mediawiki/extensions/MachineVision@master] Add options to allow job createFileList job to use subcategories

https://gerrit.wikimedia.org/r/701432

Patch should be on production tomorrow, so can run the script then

Update: the script ran on production for ~24hrs, then failed. I suspect there might be an infinite loop in the code someplace, will have to dig into it again

Change 724080 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/MachineVision@master] Add options to allow job createFileList job to use subcategories

https://gerrit.wikimedia.org/r/724080

Cparle updated the task description. (Show Details)

Change 724080 merged by jenkins-bot:

[mediawiki/extensions/MachineVision@master] Write to file list as script runs rather than at the end

https://gerrit.wikimedia.org/r/724080

Update: the script has been running since Nov 5, so far it's found ~3.8M files in the category and its subcategories

There'll be another step to run the google classifier against all the images once it's gathered them all. Do we have enough credits with google for this @CBogen ?

Update: the script has been running since Nov 5, so far it's found ~3.8M files in the category and its subcategories

There'll be another step to run the google classifier against all the images once it's gathered them all. Do we have enough credits with google for this @CBogen ?

We have more than enough credits for the ~3.8M files. Just run the final number by me on Slack to confirm before running the classifier, thanks!

Update: @Cparle stopped the script today after it had found ~31M files. There were more files to find, but we don't have the budget to classify so many files anyway.

We're going to classify and add a random 10M files from Category:Files_from_content_partnerships. @BeatEstermann, hopefully this will meet your needs - let me know if you have any feedback, thanks!

Closing for now. After talking with @BeatEstermann via e-mail, we will open a new ticket if there's a new request for this. Running the script on a random group of files doesn't meet the use case, but running it on specific categories as needed does. The script has been written so that we can do this on-demand in the future (with about a month's notice), as long as it's on a bottom-level category without nested categories.

This comment was removed by BeatEstermann.

@CBogen May I ask you to add the following category for starters? - https://commons.wikimedia.org/wiki/Category:Photographs_by_Uwe_Gerig

If we need to provide categories at a very fine-granular level for them to be included in machine vision, the one month's notice requirement is a roadblock to any project that relies on some level of agility, which is typically the case with our students' projects that involve testing tools with external partners. - Is there a way to set up a process where we can get bottom-level categories included upon request at 24h or 48h notice?

I currently do have a team of 4 students who have expressed their interest in working on improving the ISA Tool between now and the end of May 2022. Unfortunately, I will have to give them a different assignment if we cannot find a workaround for this in the course of next week.

Cheers, Beat

@BeatEstermann This is indeed a good example of a category at a fine-granular level (e.g, there are no subcategories).

However, we do still have some infrastructure work to complete before we can run this script for the first time, so we can't do it the first time without a month's notice. If we start now, we can have it for you by the end of March.

In the future, we can likely run the script on a more ad-hoc basis, with about 1-2 week's notice. We can't guarantee that, because it depends on staff availability, but it is definitely doable.

If we add this particular category by the end of March, is that useful to your students for this semester?

@CBogen I met the students last Friday, and we decided that they will work on a different assignment due to two issues: 1) This one. 2) The Commons API apparently not allowing to retrieve JPG files where the orignal file is a TIFF file.

To move forward, I think it wold be usfeul if you implemented the script as suggested. As one of our main GLAM partners for this is the ETH Library, it would be good to also add the sub-categories of https://commons.wikimedia.org/wiki/Category:Media_contributed_by_the_ETH-Bibliothek .

Thinking further ahead, I wonder:

a) Whether there would be a way to trigger your script by an API-call from the ISA-Tool. (we need to be able to include categories in a flexibel and agile manner)
b) Whether you are planning to experiment with other machine vision algorithms. Are there any free ones that provide results of reasonable quality?

Cheers, Beat

@BeatEstermann We will continue to work on the infrastructure required to implement the script, and let you know when that's complete.

Unfortunately, https://commons.wikimedia.org/wiki/Category:Media_contributed_by_the_ETH-Bibliothek is not a bottom-level category - we would need a specific list of bottom-level categories (with no sub-categories) to run the script on at any given time.

We do not have the ability to trigger the script via an API-call at this time, but will consider it for the future. We also do not have any other machine vision algorithms in our roadmap at this time, though it is not off the table for future work.

Dear @CBogen,

Here is a first list of bottom-level categories for inclusion in Machine Vision:

Media contributed by the ETH-Bibliothek
ETH-BIB Baertschi Hans-Peter‎
ETH-BIB Tiere, Pflanzen und Biotope‎
ETH-BIB Collection of scientific instruments and teaching aids
ETH-BIB Comet Photo AG‎
ETH-BIB Comet Photo AG-Luftbilder‎
ETH-BIB Comet Photo AG-Politiker und Politikerinnen-Porträts
ETH-BIB Buildings, Institutes and Laboratories of the ETH‎
Auditorium Maximum (ETH Zürich)
Tandem-Van-de-Graaff-Beschleuniger, ETH Zürich‎
ETH-BIB Immanuel Friedlaender
Images of Mount Etna by I. Friedlaender (ETH-BIB)‎
ETH-BIB Friedli-Luftbilder
ETH-BIB Friedli-Luftbilder - Zollikon‎
ETH-BIB Max Frisch‎
Historical images of Genoa by ETH-BIB Leo Wehrli - Italy
ETH-BIB Mittelholzer-Abyssinia flight 1934‎
ETH-BIB Mittelholzer-Inland flights
ETH-BIB Mittelholzer-Inland flights - Austria‎
ETH-BIB Mittelholzer-Inland flights - Czechia‎
ETH-BIB Mittelholzer-Inland flights - Diverse‎
ETH-BIB Mittelholzer-Inland flights - England
ETH-BIB Mittelholzer-Inland flights - France‎
ETH-BIB Mittelholzer-Inland flights - Haute-Savoie
ETH-BIB Mittelholzer-Inland flights - Germany‎
ETH-BIB Mittelholzer-Inland flights - Greece‎
ETH-BIB Mittelholzer-Inland flights - Italy‎
ETH-BIB Mittelholzer-Inland flights - The Netherlands
ETH-BIB Mittelholzer-Inland flights - Flights preparation‎
ETH-BIB Mittelholzer-Inland flights - Unidentified locations‎
ETH-BIB Mittelholzer-Inland flights - Clouds
ETH-BIB Mittelholzer-Inland flights - AI‎
ETH-BIB Mittelholzer-Inland flights - AR
ETH-BIB Mittelholzer-Inland flights - BE
ETH-BIB Mittelholzer-Inland flights - BL‎
ETH-BIB Mittelholzer-Inland flights - BS‎
ETH-BIB Mittelholzer-Inland flights - FR‎
ETH-BIB Mittelholzer-Inland flights - GE‎
Aerial photographs by Walter Mittelholzer - Geneva‎
ETH-BIB Mittelholzer-Inland flights - GL
ETH-BIB Mittelholzer-Inland flights - GR‎
ETH-BIB Mittelholzer-Inland flights - JU‎
ETH-BIB Mittelholzer-Inland flights - LU‎
ETH-BIB Mittelholzer-Inland flights - NE‎
ETH-BIB Mittelholzer-Inland flights - NW
ETH-BIB Mittelholzer-Inland flights - OW‎
ETH-BIB Mittelholzer-Inland flights - SG
ETH-BIB Mittelholzer-Inland flights - SH
ETH-BIB Mittelholzer-Inland flights - SO
Aerial photographs by Walter Mittelholzer - Solothurn
ETH-BIB Mittelholzer-Inland flights - SZ‎
ETH-BIB Mittelholzer-Inland flights - TG
Aerial photographs by Walter Mittelholzer - Aadorf‎
Aerial photographs by Walter Mittelholzer - Altnau‎
Aerial photographs by Walter Mittelholzer - Amriswil
Aerial photographs by Walter Mittelholzer - Arbon
Aerial photographs by Walter Mittelholzer - Berlingen‎
Aerial photographs by Walter Mittelholzer - Bettwiesen‎
Aerial photographs by Walter Mittelholzer - Bischofszell‎
Aerial photographs by Walter Mittelholzer - Bürglen‎
Aerial photographs by Walter Mittelholzer - Diessenhofen
Aerial photographs by Walter Mittelholzer - Egnach
Aerial photographs by Walter Mittelholzer - Ermatingen‎
Aerial photographs by Walter Mittelholzer - Felben-Wellhausen‎
Aerial photographs by Walter Mittelholzer - Fischingen
Aerial photographs by Walter Mittelholzer - Frauenfeld‎
Aerial photographs by Walter Mittelholzer - Gachnang‎
Aerial photographs by Walter Mittelholzer - Güttingen TG‎
Aerial photographs by Walter Mittelholzer - Hauptwil-Gottshaus‎
Aerial photographs by Walter Mittelholzer - Horn TG‎
Aerial photographs by Walter Mittelholzer - Hüttwilen
Aerial photographs by Walter Mittelholzer - Kreuzlingen‎
Aerial photographs by Walter Mittelholzer - Mammern‎
Aerial photographs by Walter Mittelholzer - Matzingen‎
Aerial photographs by Walter Mittelholzer - Rickenbach TG‎
Aerial photographs by Walter Mittelholzer - Romanshorn‎
Aerial photographs by Walter Mittelholzer - Salenstein‎
Aerial photographs by Walter Mittelholzer - Schlatt TG‎
Aerial photographs by Walter Mittelholzer - Sirnach‎
Aerial photographs by Walter Mittelholzer - Steckborn‎
Aerial photographs by Walter Mittelholzer - Glarisegg‎
Aerial photographs by Walter Mittelholzer - Stettfurt‎
Aerial photographs by Walter Mittelholzer - Wängi‎
Aerial photographs by Walter Mittelholzer - Weinfelden‎
Aerial photographs by Walter Mittelholzer - Wigoltingen‎
ETH-BIB Mittelholzer-Inland flights - TI‎
ETH-BIB Mittelholzer-Inland flights - UR‎
ETH-BIB Mittelholzer-Inland flights - VD
ETH-BIB Mittelholzer-Inland flights - VS‎
ETH-BIB Mittelholzer-Inland flights - ZG‎
ETH-BIB Mittelholzer-Inland flights - ZH‎
ETH-BIB Mittelholzer-Inland flights - ZH - Küsnacht‎
ETH-BIB Mittelholzer-Inland flights - ZH - Zollikon‎
ETH-BIB Mittelholzer-Inland flights - ZH - Zürich‎
ETH-BIB Mittelholzer-Inland flights - Zürich
ETH-BIB Mittelholzer-Kilimanjaro flight 1929-1930
ETH-BIB Mittelholzer-Lake Chad flight 1930-1931‎
ETH-BIB Mittelholzer-Los Alcázares (Lake Chad flight 1930-1931)
ETH-BIB Mittelholzer-Mediterranean flight 1928‎
ETH-BIB Mittelholzer-North Africa flight 1932
ETH-BIB Mittelholzer-Persia flight 1924-1925‎
ETH-BIB Mittelholzer-Spitsbergen flight 1923
ETH-BIB Mittelholzer-Various flights abroad‎
Aviation accidents Swissair Lockheed Orion 28-06-1934‎
ETH-BIB Portraits
ETH-BIB Swissair
ETH-BIB Views
ETH-BIB Leo Wehrli‎
ETH-BIB Leo Wehrli - Algeria
ETH-BIB Leo Wehrli - Argentina
ETH-BIB Leo Wehrli - Austria‎
ETH-BIB Leo Wehrli - Bosnia and Herzegovina
ETH-BIB Leo Wehrli - Brazil
ETH-BIB Leo Wehrli - Bulgaria
ETH-BIB Leo Wehrli - Chile‎
ETH-BIB Leo Wehrli - Colombia‎
ETH-BIB Leo Wehrli - Croatia
ETH-BIB Leo Wehrli - Cuba
ETH-BIB Leo Wehrli - Egypt‎
ETH-BIB Leo Wehrli - France‎
ETH-BIB Leo Wehrli - Germany‎
ETH-BIB Leo Wehrli - Greece‎
ETH-BIB Leo Wehrli - Hungary
ETH-BIB Leo Wehrli - Israel
ETH-BIB Leo Wehrli - Italy‎
ETH-BIB Leo Wehrli - Luxembourg‎
ETH-BIB Leo Wehrli - Lybia‎
ETH-BIB Leo Wehrli - Malta‎
ETH-BIB Leo Wehrli - Monaco
ETH-BIB Leo Wehrli - Montenegro‎
ETH-BIB Leo Wehrli - Morocco‎
ETH-BIB Leo Wehrli - The Netherlands
ETH-BIB Leo Wehrli - Norway
ETH-BIB Leo Wehrli - Palestine‎
ETH-BIB Leo Wehrli - Poland
ETH-BIB Leo Wehrli - Portugal‎
ETH-BIB Leo Wehrli - Romania
ETH-BIB Leo Wehrli - Senegal
ETH-BIB Leo Wehrli - Serbia
ETH-BIB Leo Wehrli - Slovenia‎
ETH-BIB Leo Wehrli - Spain‎
ETH-BIB Leo Wehrli - Tunisia‎
ETH-BIB Leo Wehrli - Turkey‎
ETH-BIB Leo Wehrli - Ukraine‎
ETH-BIB Leo Wehrli - United Kingdom
ETH-BIB Leo Wehrli - United States
ETH-BIB Leo Wehrli - Uruguay
ETH-BIB Leo Wehrli - Diagrams‎
ETH-BIB Leo Wehrli - Maps‎
ETH-BIB Leo Wehrli - Nature‎
ETH-BIB Leo Wehrli - People
ETH-BIB Leo Wehrli - Reproductions‎
ETH-BIB Leo Wehrli - Ships
ETH-BIB Leo Wehrli - Unidentified locations‎
ETH-BIB Leo Wehrli - AG‎
ETH-BIB Leo Wehrli - AI
ETH-BIB Leo Wehrli - AR‎
ETH-BIB Leo Wehrli - BE‎
ETH-BIB Leo Wehrli - BL
ETH-BIB Leo Wehrli - BS‎
ETH-BIB Leo Wehrli - FR‎
ETH-BIB Leo Wehrli - GE
ETH-BIB Leo Wehrli - GL‎
ETH-BIB Leo Wehrli - GR‎
ETH-BIB Leo Wehrli - JU‎
ETH-BIB Leo Wehrli - LU‎
ETH-BIB Leo Wehrli - NE‎
ETH-BIB Leo Wehrli - NW
ETH-BIB Leo Wehrli - OW‎
ETH-BIB Leo Wehrli - SG‎
ETH-BIB Leo Wehrli - SH
ETH-BIB Leo Wehrli - SO‎
ETH-BIB Leo Wehrli - SZ‎
ETH-BIB Leo Wehrli - TG‎
ETH-BIB Leo Wehrli - TI‎
ETH-BIB Leo Wehrli - UR
ETH-BIB Leo Wehrli - VD‎
ETH-BIB Leo Wehrli - VS‎
ETH-BIB Leo Wehrli - ZH‎
ETH-BIB Leo Wehrli - ZG

Is this format ok for you? Obviously, not all of these categories are "bottom-level" categories in the strict sense, but they all contain images directly, i.e. we do not expect you to dig into further sub-categories.

Hi @CBogen, note that we are currently in the process of setting up a project from which many more of such requests are expected to arise.
@NavinoEvans and I are still very much interested in discussing possibilities to automatize this process, e.g. by submitting requested categories through an API that could be called by the ISA Tool.

Cheers,
Beat

Noting that as a result of T296507 we'll have to make sure to run the script in smaller batches.

@BeatEstermann We'll put this back in our pipeline; as noted earlier we need about a month's lead time to complete it. For future similar requests, please file a new ticket, tag Structured-Data-Backlog and reference this ticket. Thank you!

FYI @Cparle

Done, processed 58495 files

I've checked the machine vision db tables for a selection of the images and they seem to have imported successfully. Can you confirm @BeatEstermann ?

Done, processed 58495 files

I've checked the machine vision db tables for a selection of the images and they seem to have imported successfully. Can you confirm @BeatEstermann ?

@BeatEstermann checking in on this, thanks!

@CBogen Thank you for this. As far as I can tell, it worked, the JPEG files show up in the ISA Tool with tag suggestions. We are however still working on resolving the issue related to the processing of TIFF files by the ISA Tool (most of the files uploaded by the ETH Library are in TIFF format).

@CBogen Thank you for this. As far as I can tell, it worked, the JPEG files show up in the ISA Tool with tag suggestions. We are however still working on resolving the issue related to the processing of TIFF files by the ISA Tool (most of the files uploaded by the ETH Library are in TIFF format).

Okay, thanks. I'm going to resolve this ticket - as discussed, if you have future requests, please file a new ticket.