Let's crush the categorisation backlog once and for all!

some categorization could be automated already.

Searching for pictures based on meta-data is called "Concept Based Image Retrieval", searching based on the machine vision recognized content of the image is called "Content Based Image Retrieval".

What I understood of Lars' request, is an automated way of finding the "superfluous" concepts or meta-data for pictures based on their content. Of course recognizing an images content is very hard (and subjective), but I think it would be possible for many of these "superfluous" categories, such as "winter landscape", "summer beach" and perhaps also "red flowers" and "bicycle".

There exist today many open source "Content Based Image Retrieval" systems, that I understand basically works in the way that you give them a picture, and they find you the "matching" pictures accompanied with a score. Now suppose we show a picture with known content (pictures from Commons with good meta-data), then we could to a degree of trust find pictures with overlapping categories. I am not sure whether this kind of automated reverse meta-data labelling should be done for only one category per time, or if some kind of "category bundles" work better. Probably adjectives and items should be compounded (eg "red flowers").

Relevant articles and links from Wikipedia:

  1. w:Image_retrieval
  2. w:Content-based_image_retrieval
  3. w:List_of_CBIR_engines#CBIR_research_projects.2Fdemos.2Fopen_source_projects

    Some demo links bawolff found:
  1. (Lire might even be integrated with CirrusSearch because it's based on Lucene)
  4. (not free)
  • Skills: image recognition/analysis, possibly Natural language processing; language depending on implementation, e.g. python for a PWB tool, JavaScript and PHP for a MediaWiki extension, JavaScript for a tool similar to the Wikidata Game.
  • Possible mentors: Kristian Kankainen (Keeleleek); WereSpielChequers?
  • Additional info: "I like the idea of automating categorisation, but I think we are a long way from being able to do much of it. So this would be a big longterm project. One of my concerns is that we are a global site, and we are trying to collect the most diverse set of images that anyone has ever assembled. Image recognition is a good way of saying that we now have another twenty images of this person, but it could be confused when we get our first images of one of the fox subspecies that we don't yet have a picture of. Or rather it would struggle to differentiate the rare and the unique from their more common cousins. There are also some spooky implications for privacy re image recognition and our pictures of people, aside from the obvious things like identifying demonstrators in a crowd or linking a series of shots of one person in such a way as to identify that this photograph of a face belongs to the same person as this photo of pubic hair because the hand is identical; We have had some dodgy things happening on Wikipedia with people wanting to categorise people ethnically and I worry that someone might use a tool such as this to try and semi accurately categorise people as say Jewish. Another major route for improved categorisation is geodata, and I think this could be a less contentious route. Not that everything has geodata, but if things have it could be a neat way to categorise a lot of images, especially if we can get boundary data so we can categorise images as being shot from within a set of boundaries rather than centroid data with all its problems that the parts of one place maybe closer to the centre of an adjacent place than the centre of the area they belong to. WereSpielChequers (talk) 09:58, 20 June 2014 (UTC)"

