Page MenuHomePhabricator

Implement searching of 'depicts' on commons
Open, NormalPublic

Description

One of the steps for Structured Data on Commons is adding 1 or more 'depicts' statements to files

File pages need to be findable using these statements

This is the master ticket for the work

Related Objects

StatusAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenNone
ResolvedAbit
OpenNone
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenCparle
ResolvedCparle
ResolvedCparle
ResolvedCparle
ResolvedCparle
OpenNone
OpenCparle
ResolvedCparle
ResolvedCparle
OpenNone
DuplicateCparle
ResolvedCparle
ResolvedCparle

Event Timeline

Cparle created this task.Apr 6 2018, 1:42 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptApr 6 2018, 1:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ramsey-WMF moved this task from Untriaged to Triaged on the Multimedia board.Apr 17 2018, 5:40 PM
Ramsey-WMF added a subscriber: Ramsey-WMF.
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Apr 23 2018, 7:18 AM
Jheald added a subscriber: Jheald.Jul 14 2018, 3:10 PM

The attached subtickets are an interesting read. They all seem to be based on taking the Q-number value of "depicts", storing it as a string in the text-search index, and then doing an indexed string-match for it. Of course first baby-steps are important, and this facility will be crucial to be able to confirm correct entry, storage, and direct retrievability of "depicts" values.

But it seems a long way short of the functionality that has generally been assumed for retrieval based on "depicts" values, and that is ultimately going to be needed.

Has the team had any initial thoughts what possible strategies are likely to be available or preferred or required, to make such more general retrieval achievable, and what sort of back-end requirements may be needed to make the system acceptably responsive (ie near-instant returns), and able to cope with full production load? Are there tickets open for these questions anywhere?

To give an idea of the sort of issue that's motivating my question, consider a user search for "man with hat".

One of the images we might hope to see included in the set returned to the user might be File:Giovanni Bellini, portrait of Doge Leonardo Loredan.jpg.

But the CommonsData for this file will most probably not include the literal statements "depicts man" or "depicts hat".

Instead, most probably, it would have the statement "depicts Q1759759", the Wikidata item for the painting.

Even if it was described in-situ, it would most probably have the statement "depicts Q250210" (Leonardo Loredan) rather than "depicts man"; and depicts (or P1354 "shown with features") Q1134210 "doge's cap".

A simple string search for depicts Q8441 "man" or depicts Q80151 "hat" is not going to match it.

It would seem that, at the very least, search is going to need to match to any items in the wikidata subclass (P279) trees of the search terms.

(As an aside, it may be worth remembering that Blazegraph can get into difficulties (T116298) if queries combine multiple path expansions (as all of ours would), without careful hinting to explore such expansions in a many-to-few direction. It also (for some reason) sometimes finds it much quicker to traverse such paths if they are given in reverse (ie ?b ^prop* ?a rather than ?a prop* ?b), even when a direction hint is given. So this might need care.)

A further complication is that male individuals on wikidata are not represented as instances of "man" (Q8441), but rather as instances of "human" (Q5), with property "sex or gender" (P21) = "male" (Q6581097).

The user interface will therefore presumably need to lead the user away from searching for "man with hat" towards "human being who is male with hat". Similarly, if a user is searching for images that depict "bald politician", wikidata does not represent individuals as instances (P31) of "politician" (Q82955); instead they are instances of "human" (Q5), with "occupation" (P106) = "politician". The faceted search UI is going to need to make it a lot easier for people to enter "human" and then "occupation", rather than "politician".

But the big question, at least for me: are the team confident that searches like these are going to be deliverable in reasonable time; and still deliverable in reasonable time at production load?

Is it a concern, for example, that anything that requires materialising the whole set of instances of Q5 "human" is inevitably going to be slow -- the number of such items on Wikidata is currently 4.375 million, and Commons will add even more. Thus, for example, just to count the number of such items with preferred images (P18) currently takes no less than 50 seconds (tinyurl.com/y7ddk8wm).

Of course, even with searches involving instances of Q5, it might not be necessary to materialise the full set, if the final result-set will ultimately be LIMITed to only, say, 2000 results. (Though beware that currently the labels SERVICE requires the entire results set to be materialised, even with a LIMIT directive). Also it's true that other clauses in the search (if present) may well be more restrictive, and therefore (if identified by the optimiser) may open up a faster solution path. Nevertheless, some searches might still require quite deep tree explorations, materialisations of quite big intermediate sets, and quite big merges, even to produce only a few hundred results.

As a team, do we have a target for how quickly results need to start being returned from such a search, if the system is to be considered decently responsive?

Are we confident that we think this should be achievable, for the faceted search back-end running in the full context of the entire wikidata dataset plus a maximal level of image description in CommonsData?

And how many such searches do we anticipate the system will need to be able to field an hour, in full production use?

(A first estimate might perhaps be some multiple of the number of Commons category pages currently served an hour; though if faceted search becomes as friendly and intuitive and powerful a tool as people hope, the hope must be that it would soon become considerably more popular than this).

Another example, where such image searches may depend quite sensitively on query construction or query optimisation: Category:Grade I listed buildings in Bedfordshire.

One of the aims for faceted search is to be able to match (or even replace) the capabilities of Commons categories like this.

In fact the category already has a query on it, tinyurl.com/yan83fm2, that may be quite similar to what such an image search could/should produce; except that it has

?item wdt:P18 ?file

which is possible now, rather than

?file "depicts" ?item

which should become possible as soon as CommonsData supports the "depicts" statement.

A key part of the query is the bit generates the set of items to be depicted. This would presumably be pretty much unchanged with the move to CommonsData. But it is quite sensitive to the order the statements are taken in.

Taken in this order (forced by the query hint), the query executes in 2.9 seconds, which should be acceptable:

hint:Query hint:optimizer "None" .
?item wdt:P131+ wd:Q23143 .
?item wdt:P1435 wd:Q15700818 .
?item wdt:P31?/wdt:P279* wd:Q41176

However, without the query hint the query is much slower, taking 50.5 seconds (for images from only 53 buildings) -- which I think people would find too long to be acceptable, and which might rapidly turn into an unsustainable server load, if very many people were trying to run queries like this at the same time.

This is why I worry quite a lot as to whether or not it will actually be possible to build faceted search in such a way that we can guarantee it will return with acceptable results in acceptable time.

I have now found T198261 and T199119, which investigate how some of the drawbacks above with the simple string-matching approach might be addressed.

@Jheald thanks for the detailed comment. We have indeed worked on a number of the things you've mentioned, but for the sake of brevity I'll focus on the "man with hat" example you gave.

What we're working on now is similar to what you talk about. We aim to introduce an explicit depicts search to supplement searching for plain text strings.

You can see round 1 of designs for this here: https://wikimedia.invisionapp.com/share/B9MYIFJGVX7#/screens/308729068

As you said:

The user interface will therefore presumably need to lead the user away from searching for "man with hat" towards "human being who is male with hat".

This is exactly what we're attempting to do on search - a new interface that allows users to specify depicts statements and qualifiers that apply to it (we're starting with the default P180 qualifiers now but encourage Commonists to talk about which others they'll need). Additionally, we'll be adding features in UploadWizard, the File page, and other areas to encourage users to add depicts statements with useful qualifiers so that this kind of search can actually work (perhaps with the assistance of suggestions from image content recognition systems).

We are also currently building alpha working versions that allow us to test performance so we know for sure what's feasible in terms of response time, server load, etc. We hope to have this all ready in an update by end of year.

Jheald added a comment.EditedJul 16 2018, 1:45 PM

Interesting slide-show. But the fundamental problem -- as some of the attached tickets start to appreciate -- is that the key information that determines whether an image fits the specification being looked for is not going to be stored in statements just on the CommonsData item for the image in question, nor just on the Wikidata item for the thing it depicts, but is going to depend on statements distributed throughout the database.

Take for instance the example raised above of Category:Grade I listed buildings in Bedfordshire.

All that you would be expecting to get on a particular image would be eg: depicts : St Andrew's church, Ampthill (Q17528295), perhaps with a qualifier shown with features: tower or shown with features: porch

But the data on the image is not going to tell you (or perhaps rather: should not be telling you) that the building is in Ampthill, or that the building is a church, or that it is Grade I listed. Instead, that will be stored (as a variety of properties) on the Wikidata item Q17528295.

It's not a question of just adding the right "depicts" and the right qualifiers on the CommonsData item for the image, and being able to search for them -- that's not where this information lives.

Even the statements on the Wikidata item Q17528295 will not tell you directly whether the item fits the specification. Q17528295 will tell you that the item is in Ampthill, but it will be another item that tells you that Ampthill is in Central Bedfordshire, and another item that tells you that Central Bedfordshire is in Bedfordshire. Similarly Q17528295 will say that the item depicted is a church, but it is only a whole chain of further items that establish that a church is a kind of building.

The expectation was that some kind of faceted search system would be needed, for the search system to be able to match the existing capability to drill down to image-sets in and below categories like Category:Grade I listed buildings in Bedfordshire. -- so that one might first specify that one was looking for a building, and be presented with up to 2000 images of buildings; then one would be offered to either refine the type of building, or choose from a list of the most common properties that items for buildings might have, then to input or choose or refine the value for that property etc etc. -- or mark it for exclusion.

In this way, given the information already on Q17528295, just the statement on the image depicts: Q17528295 should be enough to make the image findable from a search "depicts a Grade I listed building in Bedfordshire", refined through the interface.

That was the pitch for structured data, anyway. But is it realistic? What is the kind of level of demand that might be anticipated? How long might such searches take? Are there particular cacheing or optimisation or pre-computation or indexing strategies, that could help? Do they need to be designed in from the start? These are surely questions to have at least a back-of-an-envelope feel for now, well before the purely image data may start to become available in December.

Searching for 'depicts' statements using haswbstatement is already implemented.

Traversing the tree of related statements is covered by T199241, T207863 and T194401

I propose that this ticket be marked as resolved. Any objections?

Nope. Not ready for closure yet. Wait until we get the search UI launched :)

Searching for 'depicts' statements using haswbstatement is already implemented.
Traversing the tree of related statements is covered by T199241, T207863 and T194401
I propose that this ticket be marked as resolved. Any objections?