Analyze all currently available Commons images and compile a list of general classes like place.
The goal is to populate the what does this work contain? question in Upload Wizard's Describe step, see prototype and T358614: UX design for improving describe step in upload wizard.
Tasks
- look for already available statistics
- use the wmf_raw.mediawiki_page data lake table for file pages and eventually wmf_raw.mediawiki_categorylinks for categories.
- map classes/categories to Wikidata items, as a nice to have
- let a machine-learned model classify all available images
- [ON HOLD] download thumbnails of all available Commons images
- look for any pre-trained models that can classify few generic classes, instead of 1k by EfficientNet
- download mapping of 1k ImageNet classes to Wordnet 3.0 synsets, where line 1 = n01440764 = tench = class 0
- traverse Wordnet's synset tree with NLTK
- search the tree up to a generic-enough level, e.g., Animal. See also https://observablehq.com/@mbostock/imagenet-hierarchy
- agree with the team on suitable generic synsets