Page MenuHomePhabricator

search of related images on wikidata (for structured data on commons)
Open, NormalPublic

Description

We've received a request from the Structured Data team to 'create an ideal search' and to bring back 'related images' for a search.

Sample query could be: "dog" and the returned results are "big dog", "small dog", "fast dog", blue dog" ....etc.

Background: Google images currently offers a carousel of "related things" to click on when doing a search for something like "dog".

The Structured Data team is wondering if we can turn that type of query into a list of related wikidata items and then suggest to a user to click through to see all items that are tagged as 'related searches' (suggested queries are structured data). This might be possible as an aggregation but would need to see how it affects our clusters

A possible way to do this might be to figure out what the most popular Q items are that are tagged in the resultant wikidata set and return a certain number of 'top' results to the user. Another option to investigate is to search Commons and then get into Wikidata for filtering.

Event Timeline

debt created this task.Jul 25 2019, 5:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 25 2019, 5:15 PM
debt triaged this task as Normal priority.Jul 25 2019, 5:15 PM
EBernhardson added a subscriber: Ramsey-WMF.EditedJul 29 2019, 7:16 PM

@Ramsey-WMF This is basically what we can get "for free" from elasticsearch as related-structured data. This counts and displays the top N structured data statements over the search results set.

I wrote a basic python script to modify our existing queries and generate the suggestions and appended few examples of real search queries. Generally this seems to run into a bit of a data incompleteness issue. A query might have 2k results but only 3 of them have structured data, so you get whatever happens to be there. Might be acceptable though, i dunno?

query: flowers

countqidlabel
135Q506flower
70Q4421forest
60Q886167flowering plant
30Q130882Allium ursinum
30Q14923Nordkirchen
30Q55880821Ichterloh
28Q2095food
27Q1669118Münster-Geistviertel
24Q756plant
23Q130201Papaver rhoeas

Query: mimas cassini

countqidlabel
2Q193Saturn
1Q15034Mimas

Query: french kings portraits

countqidlabel
1Q10711399education in Massachusetts
1Q1140960Sonic Team
1Q12107213King Jan III Sobieski meets emperor Leopold I near Schwechat
1Q1313605The Emperor Napoleon in His Study at the Tuileries
1Q15275746Portrait équestre de Jérôme Bonaparte
1Q16994327Goosebumps
1Q17321860Portrait of Louis Napoleon, King of Holland
1Q1758615Dedham
1Q17780822Portret van koning Willem III
1Q18574017Erik XIV (1533-1577)

Query: dog

countqidlabel
44Q3578789Écopastoralisme
41Q144dog
29Q206252Citadel of Lille
28Q7368sheep
16Q247142black-tailed prairie dog
13Q2735815De Dog
6Q146066Rosa canina
4Q1144318Pit bull
4Q38280German Shepherd dog
3Q10884tree

Query: terrier

countqidlabel
5Q193475menhir
4Q1144318Pit bull
4Q221612Marmota monax
4Q38287Jack Russell Terrier
3Q247142black-tailed prairie dog
2Q10538885Jack russell terrier
2Q37550Boston Terrier
2Q37612American Pit Bull Terrier
2Q37617Airedale Terrier
2Q38984Terrier
Ramsey-WMF added a subscriber: Abit.

Hey Erik. Thanks! This is a good start. I'm interested to learn a bit more about how it works and how much we can tweak it, but there do definitely seem to be some logistical limitations that we might not even be able to solve with more data on more files. Still, something is better than nothing! 😄

Adding Amanda and some extra tags so we can track this better

Restricted Application added a project: Wikidata. · View Herald TranscriptJul 30 2019, 2:27 AM

Script used to collect above results. Note this needs access to elasticsearch directly as cirrussearch does not yet support this query: P8829

I suppose there is also the significant terms aggregation, it's similar to the aggregation above but this tries to take into account the frequency in the total document collection vs the frequency in the result set. Essentially this orders structured data statements by how much more likely they are to be found in the result set vs the overall document collection:

Query: flowers

countqidlabel
124Q506flower
34Q503978Silver-washed Fritillary
28Q161538Buddleja davidii
21Q130201Papaver rhoeas
15Q158603Vanessa cardui
13Q161745Dianthus barbatus
12Q13465759Mentha × piperita
12Q19848986Tripleurospermum inodorum
11Q111346Convolvulus arvensis
10Q1130177Centaurea jacea

Query: mimas cassini

no results

Query: french kings portraits

no results

Query: dog

countqidlabel
46Q3578789Écopastoralisme
29Q206252Citadel of Lille
17Q247142black-tailed prairie dog
13Q2735815De Dog
37Q144dog
25Q7368sheep
6Q146066Rosa canina
5Q65037194T.M. Gresham Family Vault
5Q65037400Cusack Monument
4Q5570American Bully

Query: terrier

countqidlabel
4Q38287Jack Russell Terrier
5Q193475menhir
4Q221612Marmota monax
4Q1144318Pit bull
3Q247142black-tailed prairie dog
Restricted Application added a project: Multimedia. · View Herald TranscriptAug 6 2019, 1:55 AM
debt added a comment.Aug 8 2019, 5:08 PM

Hey @Ramsey-WMF - is this enough info for you and the team or do you need more from us? Thanks!

This is good for where we are now. As you know, we're looking at longer term systems to do this in a better way, but the script that Erik provided gives us something to work with in the meantime. Thanks!