Page MenuHomePhabricator

A list of meaningful Commons Categories whose images can be used to train image classifiers
Closed, ResolvedPublic

Description

By extending the categorie list in the COCO-Stuff dataset, find a meaningful list of categories in Commons that can be used for generic image classification

Event Timeline

Previous week updates (from T242229):

  • downloaded the list of coco-stuff classes which include highly generic categories of people, animals, and things which exist in the visual world: https://github.com/nightrome/cocostuff
  • downloaded the list of categories in Commons, with the counts of the number of images per categories.
  • to create the initial seed of categories we want to consider for object categorization in Commons, I computed fasttext vectors on both COCO categories and Commons Categories, and I am checking what are the commons categories that we can use to represent COCO categories.

Weekly updates:

Weekly updates:

  • 75% of the category list has been cleaned. Will finish the clean-up this week and start the image download on stat1005.

@Miriam can you explain briefly what the challenge with the categories is?

@leila the challenge was to map a set of general categories to the very specific commons categories.
I used a semi-automated approach, where I took the list of the 5M+ categories from commons, and I tried to match them with the 200 COCO categories using word vectors. However word vectors are not necessarily the best solution for this problem, and while this approach helped reducing the space of search, i had to do a lot of cleaning up of the resulting COCO-commons matches by either removing some irrelevant Commons categories, or manually searching for more Commons categories. Now this is done although open for improvement.
Below you can find the list of 160 COCO categories for which we have matches in the set of Commons categories, and the corresponding total number of images expected.

This is the raw list of Commons categories associated to each COCO category:

Next up: image download

Weekly Update: the first round of this is done. There were a number of challenges which I will report in the final report on the feasibility of our own image classifiers. I also made a script to download images, and ran in for a few days on stat1005. 700k Images are downloaded now!