Maniphest T191633

Implement searching of 'depicts' on commons
Closed, ResolvedPublic
Actions

Description

One of the steps for Structured Data on Commons is adding 1 or more 'depicts' statements to files

File pages need to be findable using these statements

This is the master ticket for the work

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Resolved		• Ramsey-WMF	T199352 Deploy Structured Data on Commons with arbitrary Statements
Resolved		None	T215305 "Depicts and other statements on a bicycle": Qualifiers, and search by depicts statements, and other statements
Resolved		Cparle	T191633 Implement searching of 'depicts' on commons
Resolved		Cparle	T192288 Write wikibase statements data to search index in MediaInfo
Resolved		Cparle	T192345 Make keyword to match Wikibase statement data contained in the search index
Resolved		Cparle	T193012 Think about how to index and search qualifiers for 'depicts' statements
Resolved		Cparle	T193407 Store wikibase statement qualifiers in cirrus search index
Resolved		Cparle	T193175 Implement UI for constructing 'depicts' search query
Open		Cparle	T194185 Implement searching of 'depicts' on commons with the 'inscription' qualifier
Resolved		Cparle	T194245 Implement searching of 'depicts' on commons with the 'quantity' qualifier
Resolved		Cparle	T195955 New search keyword for searching for statements with a quantity
Open		None	T194255 Implement searching of 'depicts' on commons with the 'relative position within image' qualifier
Duplicate		Cparle	T194401 Investigate storing commons data in BlazeGraph
Resolved		Cparle	T197941 Implement negated searching of 'depicts' statements on commons
Resolved		Cparle	T197942 Implement negated searching of statement qualifiers on commons

Event Timeline

Cparle created this task.Apr 6 2018, 1:42 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptApr 6 2018, 1:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• EBjune triaged this task as Medium priority.Apr 6 2018, 6:04 PM

• EBjune added a parent task: T190315: [Epic] Provide search results for media file captions on Commons.

• EBjune removed a parent task: T190315: [Epic] Provide search results for media file captions on Commons.

• Ramsey-WMF moved this task from Backlog to Roadmap tasks on the SDC General board.Apr 13 2018, 11:27 PM

Cparle added a project: Multimedia-Team-Working-Board.Apr 16 2018, 1:54 PM

Cparle moved this task from To Do to Doing on the Multimedia-Team-Working-Board board.Apr 16 2018, 2:15 PM

• Ramsey-WMF moved this task from Untriaged to Triaged on the Multimedia board.Apr 17 2018, 5:40 PM

• Ramsey-WMF subscribed.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Apr 23 2018, 7:18 AM

• EBjune moved this task from needs triage to watching / waiting on the Discovery-Search board.Apr 26 2018, 5:25 PM

Cparle closed subtask T193012: Think about how to index and search qualifiers for 'depicts' statements as Resolved.Apr 30 2018, 3:47 PM

Smalyshev closed subtask T192345: Make keyword to match Wikibase statement data contained in the search index as Resolved.May 7 2018, 7:33 PM

Cparle closed subtask T197941: Implement negated searching of 'depicts' statements on commons as Resolved.Jun 29 2018, 4:40 PM

Cparle closed subtask T197942: Implement negated searching of statement qualifiers on commons as Resolved.

• Vvjjkkii reopened subtask T197942: Implement negated searching of statement qualifiers on commons as Open.Jul 1 2018, 1:02 AM

• Vvjjkkii reopened subtask T197941: Implement negated searching of 'depicts' statements on commons as Open.

• Vvjjkkii reopened subtask T193012: Think about how to index and search qualifiers for 'depicts' statements as Open.Jul 1 2018, 1:13 AM

Mainframe98 closed subtask T193012: Think about how to index and search qualifiers for 'depicts' statements as Resolved.Jul 1 2018, 9:00 AM

Cparle closed subtask T197942: Implement negated searching of statement qualifiers on commons as Resolved.Jul 2 2018, 8:57 AM

Cparle closed subtask T197941: Implement negated searching of 'depicts' statements on commons as Resolved.Jul 2 2018, 9:01 AM

The attached subtickets are an interesting read. They all seem to be based on taking the Q-number value of "depicts", storing it as a string in the text-search index, and then doing an indexed string-match for it. Of course first baby-steps are important, and this facility will be crucial to be able to confirm correct entry, storage, and direct retrievability of "depicts" values.

But it seems a long way short of the functionality that has generally been assumed for retrieval based on "depicts" values, and that is ultimately going to be needed.

Has the team had any initial thoughts what possible strategies are likely to be available or preferred or required, to make such more general retrieval achievable, and what sort of back-end requirements may be needed to make the system acceptably responsive (ie near-instant returns), and able to cope with full production load? Are there tickets open for these questions anywhere?

To give an idea of the sort of issue that's motivating my question, consider a user search for "man with hat".

One of the images we might hope to see included in the set returned to the user might be File:Giovanni Bellini, portrait of Doge Leonardo Loredan.jpg.

But the CommonsData for this file will most probably not include the literal statements "depicts man" or "depicts hat".

Instead, most probably, it would have the statement "depicts Q1759759", the Wikidata item for the painting.

Even if it was described in-situ, it would most probably have the statement "depicts Q250210" (Leonardo Loredan) rather than "depicts man"; and depicts (or P1354 "shown with features") Q1134210 "doge's cap".

A simple string search for depicts Q8441 "man" or depicts Q80151 "hat" is not going to match it.

It would seem that, at the very least, search is going to need to match to any items in the wikidata subclass (P279) trees of the search terms.

(As an aside, it may be worth remembering that Blazegraph can get into difficulties (T116298) if queries combine multiple path expansions (as all of ours would), without careful hinting to explore such expansions in a many-to-few direction. It also (for some reason) sometimes finds it much quicker to traverse such paths if they are given in reverse (ie ?b ^prop* ?a rather than ?a prop* ?b), even when a direction hint is given. So this might need care.)

A further complication is that male individuals on wikidata are not represented as instances of "man" (Q8441), but rather as instances of "human" (Q5), with property "sex or gender" (P21) = "male" (Q6581097).

The user interface will therefore presumably need to lead the user away from searching for "man with hat" towards "human being who is male with hat". Similarly, if a user is searching for images that depict "bald politician", wikidata does not represent individuals as instances (P31) of "politician" (Q82955); instead they are instances of "human" (Q5), with "occupation" (P106) = "politician". The faceted search UI is going to need to make it a lot easier for people to enter "human" and then "occupation", rather than "politician".

But the big question, at least for me: are the team confident that searches like these are going to be deliverable in reasonable time; and still deliverable in reasonable time at production load?

Is it a concern, for example, that anything that requires materialising the whole set of instances of Q5 "human" is inevitably going to be slow -- the number of such items on Wikidata is currently 4.375 million, and Commons will add even more. Thus, for example, just to count the number of such items with preferred images (P18) currently takes no less than 50 seconds (tinyurl.com/y7ddk8wm).

Of course, even with searches involving instances of Q5, it might not be necessary to materialise the full set, if the final result-set will ultimately be LIMITed to only, say, 2000 results. (Though beware that currently the labels SERVICE requires the entire results set to be materialised, even with a LIMIT directive). Also it's true that other clauses in the search (if present) may well be more restrictive, and therefore (if identified by the optimiser) may open up a faster solution path. Nevertheless, some searches might still require quite deep tree explorations, materialisations of quite big intermediate sets, and quite big merges, even to produce only a few hundred results.

As a team, do we have a target for how quickly results need to start being returned from such a search, if the system is to be considered decently responsive?

Are we confident that we think this should be achievable, for the faceted search back-end running in the full context of the entire wikidata dataset plus a maximal level of image description in CommonsData?

And how many such searches do we anticipate the system will need to be able to field an hour, in full production use?

(A first estimate might perhaps be some multiple of the number of Commons category pages currently served an hour; though if faceted search becomes as friendly and intuitive and powerful a tool as people hope, the hope must be that it would soon become considerably more popular than this).

Another example, where such image searches may depend quite sensitively on query construction or query optimisation: Category:Grade I listed buildings in Bedfordshire.

One of the aims for faceted search is to be able to match (or even replace) the capabilities of Commons categories like this.

In fact the category already has a query on it, tinyurl.com/yan83fm2, that may be quite similar to what such an image search could/should produce; except that it has

?item wdt:P18 ?file

which is possible now, rather than

?file "depicts" ?item

which should become possible as soon as CommonsData supports the "depicts" statement.

A key part of the query is the bit generates the set of items to be depicted. This would presumably be pretty much unchanged with the move to CommonsData. But it is quite sensitive to the order the statements are taken in.

Taken in this order (forced by the query hint), the query executes in 2.9 seconds, which should be acceptable:

hint:Query hint:optimizer "None" .
?item wdt:P131+ wd:Q23143 .
?item wdt:P1435 wd:Q15700818 .
?item wdt:P31?/wdt:P279* wd:Q41176

However, without the query hint the query is much slower, taking 50.5 seconds (for images from only 53 buildings) -- which I think people would find too long to be acceptable, and which might rapidly turn into an unsustainable server load, if very many people were trying to run queries like this at the same time.

This is why I worry quite a lot as to whether or not it will actually be possible to build faceted search in such a way that we can guarantee it will return with acceptable results in acceptable time.

I have now found T198261 and T199119, which investigate how some of the drawbacks above with the simple string-matching approach might be addressed.

Jheald mentioned this in T194401: Investigate storing commons data in BlazeGraph.Jul 14 2018, 10:55 PM

@Jheald thanks for the detailed comment. We have indeed worked on a number of the things you've mentioned, but for the sake of brevity I'll focus on the "man with hat" example you gave.

What we're working on now is similar to what you talk about. We aim to introduce an explicit depicts search to supplement searching for plain text strings.

You can see round 1 of designs for this here: https://wikimedia.invisionapp.com/share/B9MYIFJGVX7#/screens/308729068

As you said:

The user interface will therefore presumably need to lead the user away from searching for "man with hat" towards "human being who is male with hat".

This is exactly what we're attempting to do on search - a new interface that allows users to specify depicts statements and qualifiers that apply to it (we're starting with the default P180 qualifiers now but encourage Commonists to talk about which others they'll need). Additionally, we'll be adding features in UploadWizard, the File page, and other areas to encourage users to add depicts statements with useful qualifiers so that this kind of search can actually work (perhaps with the assistance of suggestions from image content recognition systems).

We are also currently building alpha working versions that allow us to test performance so we know for sure what's feasible in terms of response time, server load, etc. We hope to have this all ready in an update by end of year.

Interesting slide-show. But the fundamental problem -- as some of the attached tickets start to appreciate -- is that the key information that determines whether an image fits the specification being looked for is not going to be stored in statements just on the CommonsData item for the image in question, nor just on the Wikidata item for the thing it depicts, but is going to depend on statements distributed throughout the database.

Take for instance the example raised above of Category:Grade I listed buildings in Bedfordshire.

All that you would be expecting to get on a particular image would be eg: depicts : St Andrew's church, Ampthill (Q17528295), perhaps with a qualifier shown with features: tower or shown with features: porch

But the data on the image is not going to tell you (or perhaps rather: should not be telling you) that the building is in Ampthill, or that the building is a church, or that it is Grade I listed. Instead, that will be stored (as a variety of properties) on the Wikidata item Q17528295.

It's not a question of just adding the right "depicts" and the right qualifiers on the CommonsData item for the image, and being able to search for them -- that's not where this information lives.

Even the statements on the Wikidata item Q17528295 will not tell you directly whether the item fits the specification. Q17528295 will tell you that the item is in Ampthill, but it will be another item that tells you that Ampthill is in Central Bedfordshire, and another item that tells you that Central Bedfordshire is in Bedfordshire. Similarly Q17528295 will say that the item depicted is a church, but it is only a whole chain of further items that establish that a church is a kind of building.

The expectation was that some kind of faceted search system would be needed, for the search system to be able to match the existing capability to drill down to image-sets in and below categories like Category:Grade I listed buildings in Bedfordshire. -- so that one might first specify that one was looking for a building, and be presented with up to 2000 images of buildings; then one would be offered to either refine the type of building, or choose from a list of the most common properties that items for buildings might have, then to input or choose or refine the value for that property etc etc. -- or mark it for exclusion.

In this way, given the information already on Q17528295, just the statement on the image depicts: Q17528295 should be enough to make the image findable from a search "depicts a Grade I listed building in Bedfordshire", refined through the interface.

That was the pitch for structured data, anyway. But is it realistic? What is the kind of level of demand that might be anticipated? How long might such searches take? Are there particular cacheing or optimisation or pre-computation or indexing strategies, that could help? Do they need to be designed in from the start? These are surely questions to have at least a back-of-an-envelope feel for now, well before the purely image data may start to become available in December.

MarkTraceur closed subtask T194245: Implement searching of 'depicts' on commons with the 'quantity' qualifier as Resolved.Aug 3 2018, 7:37 PM

Jdforrester-WMF added a project: Structured Data Engineering.Sep 27 2018, 9:17 PM

Jdforrester-WMF moved this task from To Do to Blocked on the Structured Data Engineering board.

Jdforrester-WMF mentioned this in T205732: Get WikibaseMediaInfo to the state that we're happy to proceed with initial production deployment (captions).Nov 19 2018, 6:18 PM

Jdforrester-WMF added a parent task: T199352: Deploy Structured Data on Commons with arbitrary Statements.Jan 10 2019, 5:54 PM

Jdforrester-WMF edited parent tasks, added: T215305: "Depicts and other statements on a bicycle": Qualifiers, and search by depicts statements, and other statements; removed: T199352: Deploy Structured Data on Commons with arbitrary Statements.Feb 5 2019, 5:52 PM

Cparle closed subtask T192288: Write wikibase statements data to search index in MediaInfo as Resolved.Feb 12 2019, 4:49 PM

MarkTraceur edited projects, added Structured Data Engineering (Depicts and other statements on a bicycle); removed Structured Data Engineering.Mar 1 2019, 4:10 PM

MarkTraceur moved this task from To Do to Blocked on the Structured Data Engineering (Depicts and other statements on a bicycle) board.

Searching for 'depicts' statements using haswbstatement is already implemented.

Traversing the tree of related statements is covered by T199241, T207863 and T194401

I propose that this ticket be marked as resolved. Any objections?

Nope. Not ready for closure yet. Wait until we get the search UI launched :)

In T191633#5061359, @Cparle wrote:

Searching for 'depicts' statements using haswbstatement is already implemented.

Traversing the tree of related statements is covered by T199241, T207863 and T194401

I propose that this ticket be marked as resolved. Any objections?