Page MenuHomePhabricator

search of related images on wikidata (for structured data on commons)
Closed, DeclinedPublic


We've received a request from the Structured Data team to 'create an ideal search' and to bring back 'related images' for a search.

Sample query could be: "dog" and the returned results are "big dog", "small dog", "fast dog", blue dog" ....etc.

Background: Google images currently offers a carousel of "related things" to click on when doing a search for something like "dog".

The Structured Data team is wondering if we can turn that type of query into a list of related wikidata items and then suggest to a user to click through to see all items that are tagged as 'related searches' (suggested queries are structured data). This might be possible as an aggregation but would need to see how it affects our clusters

A possible way to do this might be to figure out what the most popular Q items are that are tagged in the resultant wikidata set and return a certain number of 'top' results to the user. Another option to investigate is to search Commons and then get into Wikidata for filtering.

Event Timeline

debt created this task.Jul 25 2019, 5:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 25 2019, 5:15 PM
debt triaged this task as Medium priority.Jul 25 2019, 5:15 PM
EBernhardson added a subscriber: Ramsey-WMF.EditedJul 29 2019, 7:16 PM

@Ramsey-WMF This is basically what we can get "for free" from elasticsearch as related-structured data. This counts and displays the top N structured data statements over the search results set.

I wrote a basic python script to modify our existing queries and generate the suggestions and appended few examples of real search queries. Generally this seems to run into a bit of a data incompleteness issue. A query might have 2k results but only 3 of them have structured data, so you get whatever happens to be there. Might be acceptable though, i dunno?

query: flowers

60Q886167flowering plant
30Q130882Allium ursinum
23Q130201Papaver rhoeas

Query: mimas cassini


Query: french kings portraits

1Q10711399education in Massachusetts
1Q1140960Sonic Team
1Q12107213King Jan III Sobieski meets emperor Leopold I near Schwechat
1Q1313605The Emperor Napoleon in His Study at the Tuileries
1Q15275746Portrait équestre de Jérôme Bonaparte
1Q17321860Portrait of Louis Napoleon, King of Holland
1Q17780822Portret van koning Willem III
1Q18574017Erik XIV (1533-1577)

Query: dog

29Q206252Citadel of Lille
16Q247142black-tailed prairie dog
13Q2735815De Dog
6Q146066Rosa canina
4Q1144318Pit bull
4Q38280German Shepherd dog

Query: terrier

4Q1144318Pit bull
4Q221612Marmota monax
4Q38287Jack Russell Terrier
3Q247142black-tailed prairie dog
2Q10538885Jack russell terrier
2Q37550Boston Terrier
2Q37612American Pit Bull Terrier
2Q37617Airedale Terrier
Ramsey-WMF added a subscriber: Abit.

Hey Erik. Thanks! This is a good start. I'm interested to learn a bit more about how it works and how much we can tweak it, but there do definitely seem to be some logistical limitations that we might not even be able to solve with more data on more files. Still, something is better than nothing! 😄

Adding Amanda and some extra tags so we can track this better

Restricted Application added a project: Wikidata. · View Herald TranscriptJul 30 2019, 2:27 AM

Script used to collect above results. Note this needs access to elasticsearch directly as cirrussearch does not yet support this query: P8829

I suppose there is also the significant terms aggregation, it's similar to the aggregation above but this tries to take into account the frequency in the total document collection vs the frequency in the result set. Essentially this orders structured data statements by how much more likely they are to be found in the result set vs the overall document collection:

Query: flowers

34Q503978Silver-washed Fritillary
28Q161538Buddleja davidii
21Q130201Papaver rhoeas
15Q158603Vanessa cardui
13Q161745Dianthus barbatus
12Q13465759Mentha × piperita
12Q19848986Tripleurospermum inodorum
11Q111346Convolvulus arvensis
10Q1130177Centaurea jacea

Query: mimas cassini

no results

Query: french kings portraits

no results

Query: dog

29Q206252Citadel of Lille
17Q247142black-tailed prairie dog
13Q2735815De Dog
6Q146066Rosa canina
5Q65037194T.M. Gresham Family Vault
5Q65037400Cusack Monument
4Q5570American Bully

Query: terrier

4Q38287Jack Russell Terrier
4Q221612Marmota monax
4Q1144318Pit bull
3Q247142black-tailed prairie dog
Restricted Application added a project: Multimedia. · View Herald TranscriptAug 6 2019, 1:55 AM
debt added a comment.Aug 8 2019, 5:08 PM

Hey @Ramsey-WMF - is this enough info for you and the team or do you need more from us? Thanks!

This is good for where we are now. As you know, we're looking at longer term systems to do this in a better way, but the script that Erik provided gives us something to work with in the meantime. Thanks!

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptMar 16 2020, 3:36 PM
CBogen added a subscriber: CBogen.Aug 27 2020, 8:54 PM

Closing because we're tackling this with a different approach using concept chips in T256431.

CBogen closed this task as Declined.Aug 27 2020, 8:54 PM