Page MenuHomePhabricator

Look into matching images of the same painting
Open, LowestPublicFeature


As discussed with @Halfak: For I made . This page offers pairs of files that might be of the same painting. Matching is done based on de creator and the institution. Underlying database queries are available at and

Would it be possible to use this data to offer better suggestions?

Event Timeline

We'd need someone to look into image processing strategies in order to get started on this.

Halfak triaged this task as Lowest priority.Aug 4 2016, 2:40 PM

Hi @calbon, you archived so now this open task has no active project tags associated, which means this task cannot be found by looking at any project workboard. Could you please either add an active project tag to this task, or close this task in case any of the resolution statuses apply? Thanks!

Putting this on the main AI board that is designed for interesting ideas that no one is ready to pick up yet like this one.

I think image embeddings could be a really effective strategy for this and it would also be relevant to a lot of other types of image modeling tasks. So, I figure @Miriam might be interested in it. Essentially, the idea (if I understand it) is to dedupe commons.

Essentially, the idea (if I understand it) is to dedupe commons.

That would imply deleting things. Goal is just to identify images of paintings so that they can be linked to the correct Wikidata item.

I think image embeddings could be a really effective strategy for this

If you're looking to do near duplicate detection an embedding model may be overkill and you can probably get a fair way with simpler features and some kind of fingerprinting (shingling or sketching). Certainly linking to a canonical set (the wikidata item) makes the end to end process easier since a lot of complexity with near duplicate detection is the picking of what should be the canonical example where 2+ items are near dups. Of course if you _do_ have some nice image embedding features around for another model they could definitely be useful.

Is this a good starter task for someone like me starting to poke around? ( This task looks to have been open awhile ago so not sure if stale... )

Yeah this is definitely a nice task to jump in, Mat.

Hello! Jumping in here. @Multichill and I worked on a quite simple near-duplicate detection algorithm based on a color-based fingerprint. It was good at detecting exact duplicates, less good for near duplicates, but at least was useful to discard pairs of very unrelated images.
@mat_kelcey we unfortunately don't have yet a simple way to generate image embeddings at scale internally, but maybe you can try this pytorch script to get alexnet or resnet-based vectors. It's very easy to use. Or, much simpler, you can use the ImageHash python package.
Hope this helps!

Thanks @Miriam for the info! On the image processing side I'm pretty comfortable with the options available and have a some prior experience on near duplicate detection pipelines in general. I'll spend some time working through the various bits and pieces on the original page and see what I can get going. The original idea was from 2016, so not sure what's still applicable, but I'm happy to use it to get a sense of the moving parts. I did a fair bit of work with wikidata a number of years ago but it can't have changed too much :)

I'll look to post a proposal / design doc for a piece of work here in a week (or two) as I try to ramp up.


Hi @mat_kelcey! Do you need a sidekick on this task?

I'm new here - I applied for the Outreachy internship project Not Safe for Work (NSFW) media Classifier for Wikimedia Commons but in the end it turned out that I'm not eligible for this round because of my commitments as a student.

So now I'm looking for another way to contribute :)

I have some experience with deep learning, image processing, perceptual hashing, Python & shell scripting, as well as with research and writing in general (proposals, reports, etc.) Would any of this be useful to you?


Yeah absolutely! I've only made a slow start and so far I've spent some time poking around the way the data is structured; I've worked on knowledge graphs before, but not wikimedia's so just getting a sense of things. I feel like this piece of work, as with most (all?) data projects, is as much about the orchestration of the moving parts as the algorithmic details :) Going to spend a few hours on it tomorrow (one of my days off) and have some notes here that I'll add to tomorrow.

As a newbie to the whole Wikimedia ecosystem, what would you suggest as the best way to get up to speed on how different parts fit together? Are there APIs to access the various bits and pieces of data we would need for this project? Or do we need to create all pipelines from scratch? As you say @mat_kelcey, the algorithmical details seem like the most straightforward part of this - or maybe that's just how I perceive things from my currently limited POV :))

Hi folks, happy to see activity on this task. Current code at

Some background: At on of the hackathons I build a prototype to match items on Wikidata about paintings. For example in and gets matched. These exact matches are quite easy, if an image is in use on a painting on Wikidata I can just add the missing backlink.

It gets more interesting when you start matching on other facets. I queried both Wikidata and Commons to retrieve:

  • The painter who made the painting
  • The collection/location/institution the painting is in (slightly different definitions)
  • The inventory number of the painting

See,_institution_and_inventory_number_match and linked pages for all the matches.
Take for example this is matched with which uses so that's included for easy visual matching.

The code used to produce this is horrible and everything was running in memory. Memory usage slowly grew until it neared 8GB and it stopped working so I disabled it (2019-04-19 according to my log).
The concept of matching items of paintings on Wikidata with images of paintings on Commons still has a lot of potential: Wikidata has over 465.000 paintings ( ) and Commons at least 100.000 images of paintings without link to Wikidata ( ) and probably much more ( 3M files in

The right approach would probably to setup a new platform with harvesting and processing of potential matches on one side and a nice user interface on the other side to process these matches.
I haven't taken that step yet because it's quite a lot of work and I would like to have someone involved for a nice frontend (still trying to convince @Husky ).

awesome, thanks @Multichill , my eventual question was where is the code so that gives me something to work through today :)

here's the general flow i've built in the past; just brain dumping and looking for feedback on what existing currently in the wikimedia toolset. i'd never intend for all this to be done in an MVP, some of it marginal gain for non trivial effort.

a UI to review matchs

  • would need to support not just the question does this pair of images match but
  • are these cluster of images the same?
    • this is useful for the wikidata -> commons matching
    • but also useful for commons matching when there is no wikidata images (decideds) yet

i feel like annotation UIs are a exploding startup idea at the moment. even if they are paid i reckon we could wrangle something (it'd be worth giving us free access for for their branding "powering wikimedia data deduplication"!!!

an annotation queue

  • we should treat annotation as a queue since we want to always make the most valuable use of user time
    • it's also the case the often the resolution of one item, changes the priority of another (or even the need to do it at all!)
  • a queue allows a mixing of items for review
    • the main items we want are duplicates we wish to resolve
    • but at the same time, to make over all progress, we sometimes want to drop items in that give information back to the system (e.g. the classic active learning loops where we want uncertainty and diversity sampling queue items that are more focussed on giving information back to the system than necessarily resolving the highest value items)
  • a queue though comes with the con that it requires some real time management; it's not just a huge list of candidates to review; i've never regretted building one though.

a queue feeding system

  • feeding the queue are a number of batch system that can enqueue entries; along the lines described above
    • highest matching candidates
    • uncertainty samples ( with respect to the matching algorithm )
    • diversity samples ( with respect to the training data )
  • the matching candidates would be fed incremental (with a mix potentially)
    • (creator, institution, inventory number) matchs
    • (creator, inventory number) matchs
    • other fields
    • {phash(img_1), phash(img_2), phash(img_3), ... } matches

finally, a semantic matching system

  • finding of set of matching images {phash(img_i), , ... }
    • preference for this first version is that one of these images is wikidata, and the rest are from commons; i.e. we can bulk link commons back to wikidata
    • the longer term is sets of these where no items in the cluster are from wikidata; these form the original problem; how do we get an image for wikidata items that we can match to this cluster

( note that the pipeline pieces here relates very closely to a general annotation pipeline for gathering training data )

anyways, just brain dumping before kids are off to school. will start stepping through that code today.


@Multichill where would be a good place to ping you with low priority questions?

e,g, how you generated ""

( i get the feeling this ticket isn't the best place for an endless stream of trivial questions... happy to batch them up. )


anyone have a clue where i might be able to get the query for the paintings_without_wikidata_in_painting_category.txt mentioned in the last post? i could try to replicate it based on the result set but that doesn't always work so well.

minor update

i've been working reproducing the original script in a PAWS notebook
been doing it in a notebook since it gives me a lot of flexibility to explore sideways

the one trouble is the notebook doesn't seem to be a good execution environment for these queries which almost always timeout. i can bumble through by retrying and caching results but it's painful. i need to work out how to either configure PAWs for longer queries or run these another way.

will keep chipping away, it's good to work on something like this since it includes not just the image dedup but a lot of related moving parts

Aklapper changed the subtype of this task from "Task" to "Feature Request".Jan 3 2022, 11:11 AM