Page MenuHomePhabricator

[L] [SPIKE] Investigate traversing entities tree to include more entities with more detail
Closed, ResolvedPublic

Description

Commons structured data guidelines call to label depicts statement as specific as possible. I.e. an image of a chihuahua should be depicts=chihuahua, not depicts=dog
Still, when searching for "dog", that is a relevant image. We should be able to include it, because the "chihuahua" entity is a subclass of the "dog" entity.

We already search for relevant entities that match the search term, here's what else we need to figure out:

  • Figure out which entity matches we want to consider finding subclasses of (how perfect of a match should it be before (e.g. Q28284645 may not be as relevant for "cat" as Q146 is, and not worth traversing the tree for)
  • Figure out what entities to traverse (instance of? subclass of?)
  • Figure out how many levels deep to go (in terms of when results get too noisy)
  • Figure out how to decay the boost based on how many levels deep we go (assuming that going deeper adds more noise)
  • Figure out whether there are performance concerns to querying sparkle for these data
  • Figure out whether there are performance concerns to including and scoring a massive amount of statement-based matches in elastic (and if so, where to draw the line)
  • Given all of the above - is this worth building? Rough estimate on implementation?

Event Timeline

Note that we plan to timebox the effort on this spike to one week.

CBogen renamed this task from [SPIKE] Investigate traversing entities tree to include more entities with more detail to [L] [SPIKE] Investigate traversing entities tree to include more entities with more detail.Jul 15 2020, 4:28 PM

Quick brain dump before the weekend.

The reason that this work would be important is that we have very little information to find images by. The machine doesn't know what's in a picture until someone explicitly described it, and we will only know it by those words.
Titles & wikitext contain limited information (and usually not in many languages), as do captions. If the text describes a "german shepherd", then a search for "dog" will not reveal it (or vice versa.)
Depicts statements are very different in that they possess an awful lot of associated information: not only do we have multilingual labels to find them with, but they have an awful lot of context.
Depicts statements call for "being as specific as you can" so even if an images is never described as a dog, we could figure out that it depicts a 'dog' via the 'german shepherd' statements.

This approach would be different than previous approaches we've considered.
Previously, we were considering traversing the tree upward (from most detailed entity to parent concepts: e.g. british shorthair -> cat -> mammal -> animal) and storing those in the search index.
That approach has 2 significant issues: 1/ it's a nightmare to keep indexes up to date as relationships or parent entities are updated (and all multilingual labels), and 2/ generic parent entities add a lot of noise (see T199119)
Both of those problems would not be an issue here: there appears to be (note - need to confirm with tests on production data) a lot less noise when the tree is traversed in the other direction, and we wouldn't need to add anything else to the search index or keep in sync. Additionally, the existing entity matching is likely better than anything we could come up with by denormalizing a set of associated parent labels into the search index.
That said, traversing entity trees for every search is quite intensive, and adding those statements to the search query will also have implications. We will have to limit the amount of entities traversed, potentially to the point where this simply doesn't even make sense (note - looking into)

We'd traverse entities from generic to more detailed.
E.g. user searches "dog" & we already find that Q144 is a matching statement. Images that match P180=Q144 are already being found, but images tagged with P180=Q38280 (german shepherd) should also be found.
I think these 3 properties make sense to traverse:

  • instance of (P31)
    • This one probably contains the most useless data: the entities are relevant, but often so specific that many of them have no images attached. This is one that we could consider dropping if the tree becomes too large.
  • parent taxon (P171)
  • subclass of (P279)

Traversal of these entities should be across all of them, not simply 1 property down: e.g. Golden Gate Bridge is instance of suspension bridge, which is subclass of bridge.

Here's a quick example query I was using to try out a few things, for my own convenience when I pick this back up next week.

SELECT DISTINCT ?parent ?item ?depth ?itemLabel {
  VALUES ?parent { wd:Q7377 } .
  {
    ?item wdt:P279|wdt:P171|wdt:P31 ?parent .
    BIND(1 AS ?depth) .
  } UNION {
    ?item (wdt:P279|wdt:P171|wdt:P31)/(wdt:P279|wdt:P171|wdt:P31) ?parent .
    BIND(2 AS ?depth) .
  } UNION {
    ?item (wdt:P279|wdt:P171|wdt:P31)/(wdt:P279|wdt:P171|wdt:P31)/(wdt:P279|wdt:P171|wdt:P31) ?parent .
    BIND(3 AS ?depth) .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
LIMIT 1000

Next week:

  • try to run the equivalent of haswbstatement:P180=[all of the tree's children] for an awful lot of test entity/traversal property/traversal depth-combinations against production commons search index to figure out the added value (how many new results, how relevant) and performance impact.

After some quick tests, elastic seems to respond just as fast with (up to) 50 (current situation) or 750 statements. But there's an upper limit of 1024 clauses ATM (too_many_clauses: maxClauseCount is set to 1024)

Change 622359 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] [WIP] Test entity traversal

https://gerrit.wikimedia.org/r/622359

Based on the simplistic example query pasted earlier, it looks like it takes somewhere in the 300-600ms range to find nested statements (capped at 3 levels deep, limited at 750)
This will vary slightly based on the exact query (and we're going to need to keep it as simply as possible), but this is the ballpark.
We should probably cache that result (entity tree) for a short time so that we won't have to do it again as users continue to load additional pages of results for the same search term.

Short summary:

  • We could include many more relevant files by including entities related to some that match the search string. We *should* even do this, because the guidelines for depicts are to be as specific as possible, and more generic search terms miss out on those quality images. E.g. a search for 'cat' would find Q146 (house cat), but could also include Q7714263 (Grumpy Cat, based on P31: instance of) or Q147 (kitten, based on P279: subclass of.) P171: parent taxon would also be worth traversing.
  • P279 (subclass of) tends to lose some detail & would need to decay based on depth. E.g. an image of a German shepherd may be tagged with depicts: German Shepherd dog (Q38280), which is subclass of dog (Q144), which is subclass of pet (Q39201). However, the German shepherd in that image may be a working dog (Q1806324) rather than a pet, and it would be somewhat incorrect to include that image in a search for pet.
  • Except for above example, there seems to be very little noise in the data when traversing in reverse down these properties. There are a lot of entities that will not be depicted in any file, but that's ok. The entities that do have files are usually highly relevant.
  • The amount of related entities might be massive. E.g. 'car' will result in a massive amount of entities for just about every possible car model. Ideally, we'd include them all, but 1/ traversing them all takes time, and 2/ elastic is currently capped at 1024 clauses. 750 seems like a reasonable cutoff?
  • AFAICT from some tests, adding 750 more statement clauses to the existing elastic query has an negligible/imperceptible impact on its execution time.
  • AFAICT from some tests, a simple-ish SPARQL query (limited at 750) would add about 300-500ms of extra latency to each request. While not excessive, it is yet another step in the process that can't happen async (and we already have an additional step of fetching entities that match the search term in the first place). Ideas to optimize this (or clever ideas to conditionally execute based on when they seem most valuable) are welcome.

See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/622359 for a very simplistic proof of concept.

Thoughts?

I like the idea of using the wikidata graph (via SPARQL) to explore possibilities of pulling interesting data to feed a query expansion engine.
Using WDQS for serving real time search traffic on the other hand is not an option I think (for perf reasons) but I believe it could make sense to create a dedicated dataset using the findings you've made here. This dataset could be used for two purposes:

  • the initial concept lookup (replacing the need to use wikidata fulltext search)
  • the expansion of the concepts following certain paths of the graph like you experimented

It shouldn't be too hard to build such dataset and push it elasticsearch (on a weekly basis):

  • we will soon have the full wikidata graph replicated to hdfs weekly (the commons graph should not be to hard to add as well enabling interesting joins)
  • @EBernhardson has built a pipeline that can transfer data from hdfs to production elasticsearch

having an offline process might also help to do things like:

  • filter entities we really want (e.g. drop scientific papers)
  • join with other datasets available in hdfs (popularity, wikipedia dumps)

that would be almost impossible by reusing existing services.

Change 622359 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Test entity traversal

https://gerrit.wikimedia.org/r/622359

Change 641154 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Fix entity traversal POC

https://gerrit.wikimedia.org/r/641154

Change 641154 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Fix entity traversal POC

https://gerrit.wikimedia.org/r/641154

After some additional experimentation, it's worth expanding on my earlier summary with some more findings.

Reasons/benefits:

  1. After analyzing tens of thousands of image scores (from any fields), it is obvious that depicts statement matches are much, much more relevant than text matches in any text field (see T271799).
  2. Depicts statements are encouraged to be as specific as possible, but typical search usage is much more generic (e.g. "flowers" or "book" or "fish")

This means that we are currently missing out on an awful lot of search signals (the depicts statements of higher specificity than the search term) that are known to be of excellent quality (statements are by far the most reliably indicator of relevance).
Traversing simple Wikidata properties would enable us to both 1/ find more images for certain search terms (especially in foreign languages, when the search term isn't likely to already be part of content but only found via Wikidata aliases), and 2/ improve the precision of the results, because it'll add more (high quality) signals to the set.

Entities tree data quality:

I experimented with properties P31 (instance of), P279 (subclass of), P171 (parent taxon) and P180 (depicts) - all properties that should mean that the linked entity also is the parent entity.

  1. depicts (P180) on Wikidata's side (so essentially "depicts of depicts") is likely not a good property to traverse because of the many artworks that are described in great detail. An artwork might depict a object as an insignificant detail or in an abstract way, but they're usually not relevant for the search term. E.g. pictures of the Mona Lisa painting are not relevant when I search for "mountain" or "bridge" (which - I had never even noticed - are depicted in the Mona Lisa, see https://www.wikidata.org/wiki/Q12418#P180). The other 3 properties, however, seem highly relevant.
  2. The Wikidata ontology is very inconsistent and incomplete. E.g. there is no obvious direct link between goldfish (Q123141) and fish (Q152)
  3. It is occasionally incorrect. E.g. German Shepherd dog (Q38280) is subclass of (P279) dog (Q144) is subclass of (P279) pet (Q39201), even though a German Shepherd dog is not always a pet (might be a working dog (Q1806324) or service dog (Q39398))
  4. Many of the downstream (more generic entity down to more specific entities) are unlikely to ever be used as a statement for a file. E.g. the many parent taxons between Macaw and bird: Macaw (Q219035) is instance of (P31) Psittacidae (Q8327) has parent taxon (P171) Psittacoidea (Q13624220) has parent taxon (P171) Psittaciformes (Q31431) has parent taxon (P171) Psittacopasserae (Q5856078) has parent taxon (P171) Eufalconimorphae (Q326483) has parent taxon (P171) Australaves (Q14635103) has parent taxon (P171) Telluraves (Q20645445) has parent taxon (P171) Inopinaves (Q22007593) has parent taxon (P171) Neoaves (Q2330918) has parent taxon (P171) Neognathae (Q19168) has parent taxon (P171) bird (Q5113)
  5. The downstream statements that do have images, however, are - almost with no exception - extremely accurate and many. E.g. for house cat (Q146): File:Gerollte_Katze_20201104_DSC5446.jpg (depicts British Shorthair (Q29273) subclass of (P279) house cat (Q146)), File:Grumpy_Cat_by_Gage_Skidmore.jpg (depicts Grumpy Cat (Q7714263) instance of (P31) house cat (Q146))

There are occasional quirks or in the Wikidata ontology that could lead to finding downstream entities that are not actually relevant to the search term. They are few, though (occurring much less than mishaps in the other search fields, where content is often matched out of context) and should be fixed in the Wikidata ontology. They are probably not something that we should even consider or implement workarounds (e.g. decay) for.

While the Wikidata ontology is incomplete and has gaps, a simple traversal of P31, P279 and P171 does manage to produce a staggering number of highly relevant entities/images. This obviously varies per search term, and more generic terms (which seems to be the majority of searches) would benefit most.

Implementation issues:

  1. Adding parent entities to the existing search index is not an option: 1/ too much data, and 2/ too many updates when relationships change
  2. WDQS is, for performance reasons, not an option to traverse the child entities
  3. Some trees are massive (with many irrelevant entities not used in any images) and easily exceed Elastic's (current) limit of 1024 clauses

We simply don't currently have any means to traverse these entities in a way that is efficient enough to serve production search traffic.

Next steps:

  1. It looks like (T258055#6474214) Search Platform has already been thinking about building a dedicated service to allow these kind of lookups. They must be 1/ fast (it'd have to be done for every search executed, so must be very fast & efficient) and 2/ filtered (only entities that are actually used in images as depicts statement should be returned, for minimal impact on ES). Relevant entities would be any P31, P279 or P171 relationship.
  2. Use above lookup to extend the existing P180=Q<id> lookups already happening in Media Search.

Change 655888 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Remove experimental entity traversing search profile

https://gerrit.wikimedia.org/r/655888

After some additional experimentation, it's worth expanding on my earlier summary with some more findings.

Nice write up—thanks!

Random hare-brained idea: Do we have enough sample queries to build a map of "hotspots" in the ontology? For example, I'm sure that in the ontological chain from macaw to bird, more queries map to bird than anything else, and that macaw is in second place on that chain. If we could identify common hotspots within the ontology, we could think about indexing one or two "ancestral hotspots" for each item tagged. So macaw might also get bird, and maybe animal or something, skipping over all the intermediate levels. The question is whether such hotspots can be identified by the data we have available, and whether they account for 90% of queries—in which case this would save lots of time walking the ontology—or if they account for 1% of queries—in which case it isn't worth is.

If it did work, it would cut down on the amount of data to be stored. For updates when relationships change, it depends on what kinds of changes are common. It's possible that relationships to ancestral hotspots are more stable—macaw is probably going to stay under bird; on the other hand, dog may not be under pet in the future (though representing contingent and/or prototypical relationships is always hard).

Change 655888 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Remove experimental entity traversing search profile

https://gerrit.wikimedia.org/r/655888

Do we have enough sample queries to build a map of "hotspots" in the ontology?

I'm not aware of any such data (and am not optimistic that it exists), but we can probably cross-reference with statement usage on Commons files to achieve the essentially same thing, where the un(der)used in-between entities are filtered out?