Page MenuHomePhabricator

[L] [SPIKE] Investigate traversing entities tree to include more entities with more detail
Open, Needs TriagePublic

Description

Commons structured data guidelines call to label depicts statement as specific as possible. I.e. an image of a chihuahua should be depicts=chihuahua, not depicts=dog
Still, when searching for "dog", that is a relevant image. We should be able to include it, because the "chihuahua" entity is a subclass of the "dog" entity.

We already search for relevant entities that match the search term, here's what else we need to figure out:

  • Figure out which entity matches we want to consider finding subclasses of (how perfect of a match should it be before (e.g. Q28284645 may not be as relevant for "cat" as Q146 is, and not worth traversing the tree for)
  • Figure out what entities to traverse (instance of? subclass of?)
  • Figure out how many levels deep to go (in terms of when results get too noisy)
  • Figure out how to decay the boost based on how many levels deep we go (assuming that going deeper adds more noise)
  • Figure out whether there are performance concerns to querying sparkle for these data
  • Figure out whether there are performance concerns to including and scoring a massive amount of statement-based matches in elastic (and if so, where to draw the line)
  • Given all of the above - is this worth building? Rough estimate on implementation?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2020, 1:02 PM
CBogen added a subscriber: CBogen.Jul 15 2020, 4:27 PM

Note that we plan to timebox the effort on this spike to one week.

CBogen renamed this task from [SPIKE] Investigate traversing entities tree to include more entities with more detail to [L] [SPIKE] Investigate traversing entities tree to include more entities with more detail.Jul 15 2020, 4:28 PM
matthiasmullie added a comment.EditedJul 24 2020, 2:31 PM

Quick brain dump before the weekend.

The reason that this work would be important is that we have very little information to find images by. The machine doesn't know what's in a picture until someone explicitly described it, and we will only know it by those words.
Titles & wikitext contain limited information (and usually not in many languages), as do captions. If the text describes a "german shepherd", then a search for "dog" will not reveal it (or vice versa.)
Depicts statements are very different in that they possess an awful lot of associated information: not only do we have multilingual labels to find them with, but they have an awful lot of context.
Depicts statements call for "being as specific as you can" so even if an images is never described as a dog, we could figure out that it depicts a 'dog' via the 'german shepherd' statements.

This approach would be different than previous approaches we've considered.
Previously, we were considering traversing the tree upward (from most detailed entity to parent concepts: e.g. british shorthair -> cat -> mammal -> animal) and storing those in the search index.
That approach has 2 significant issues: 1/ it's a nightmare to keep indexes up to date as relationships or parent entities are updated (and all multilingual labels), and 2/ generic parent entities add a lot of noise (see T199119)
Both of those problems would not be an issue here: there appears to be (note - need to confirm with tests on production data) a lot less noise when the tree is traversed in the other direction, and we wouldn't need to add anything else to the search index or keep in sync. Additionally, the existing entity matching is likely better than anything we could come up with by denormalizing a set of associated parent labels into the search index.
That said, traversing entity trees for every search is quite intensive, and adding those statements to the search query will also have implications. We will have to limit the amount of entities traversed, potentially to the point where this simply doesn't even make sense (note - looking into)

We'd traverse entities from generic to more detailed.
E.g. user searches "dog" & we already find that Q144 is a matching statement. Images that match P180=Q144 are already being found, but images tagged with P180=Q38280 (german shepherd) should also be found.
I think these 3 properties make sense to traverse:

  • instance of (P31)
    • This one probably contains the most useless data: the entities are relevant, but often so specific that many of them have no images attached. This is one that we could consider dropping if the tree becomes too large.
  • parent taxon (P171)
  • subclass of (P279)

Traversal of these entities should be across all of them, not simply 1 property down: e.g. Golden Gate Bridge is instance of suspension bridge, which is subclass of bridge.

Here's a quick example query I was using to try out a few things, for my own convenience when I pick this back up next week.

SELECT DISTINCT ?parent ?item ?depth ?itemLabel {
  VALUES ?parent { wd:Q7377 } .
  {
    ?item wdt:P279|wdt:P171|wdt:P31 ?parent .
    BIND(1 AS ?depth) .
  } UNION {
    ?item (wdt:P279|wdt:P171|wdt:P31)/(wdt:P279|wdt:P171|wdt:P31) ?parent .
    BIND(2 AS ?depth) .
  } UNION {
    ?item (wdt:P279|wdt:P171|wdt:P31)/(wdt:P279|wdt:P171|wdt:P31)/(wdt:P279|wdt:P171|wdt:P31) ?parent .
    BIND(3 AS ?depth) .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
LIMIT 1000

Next week:

  • try to run the equivalent of haswbstatement:P180=[all of the tree's children] for an awful lot of test entity/traversal property/traversal depth-combinations against production commons search index to figure out the added value (how many new results, how relevant) and performance impact.
matthiasmullie added a comment.EditedTue, Aug 25, 2:09 PM

After some quick tests, elastic seems to respond just as fast with (up to) 50 (current situation) or 750 statements. But there's an upper limit of 1024 clauses ATM (too_many_clauses: maxClauseCount is set to 1024)

Change 622359 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] [WIP] Test entity traversal

https://gerrit.wikimedia.org/r/622359

Based on the simplistic example query pasted earlier, it looks like it takes somewhere in the 300-600ms range to find nested statements (capped at 3 levels deep, limited at 750)
This will vary slightly based on the exact query (and we're going to need to keep it as simply as possible), but this is the ballpark.
We should probably cache that result (entity tree) for a short time so that we won't have to do it again as users continue to load additional pages of results for the same search term.

Short summary:

  • We could include many more relevant files by including entities related to some that match the search string. We *should* even do this, because the guidelines for depicts are to be as specific as possible, and more generic search terms miss out on those quality images. E.g. a search for 'cat' would find Q146 (house cat), but could also include Q7714263 (Grumpy Cat, based on P31: instance of) or Q147 (kitten, based on P279: subclass of.) P171: parent taxon would also be worth traversing.
  • P279 (subclass of) tends to lose some detail & would need to decay based on depth. E.g. an image of a German shepherd may be tagged with depicts: German Shepherd dog (Q38280), which is subclass of dog (Q144), which is subclass of pet (Q39201). However, the German shepherd in that image may be a working dog (Q1806324) rather than a pet, and it would be somewhat incorrect to include that image in a search for pet.
  • Except for above example, there seems to be very little noise in the data when traversing in reverse down these properties. There are a lot of entities that will not be depicted in any file, but that's ok. The entities that do have files are usually highly relevant.
  • The amount of related entities might be massive. E.g. 'car' will result in a massive amount of entities for just about every possible car model. Ideally, we'd include them all, but 1/ traversing them all takes time, and 2/ elastic is currently capped at 1024 clauses. 750 seems like a reasonable cutoff?
  • AFAICT from some tests, adding 750 more statement clauses to the existing elastic query has an negligible/imperceptible impact on its execution time.
  • AFAICT from some tests, a simple-ish SPARQL query (limited at 750) would add about 300-500ms of extra latency to each request. While not excessive, it is yet another step in the process that can't happen async (and we already have an additional step of fetching entities that match the search term in the first place). Ideas to optimize this (or clever ideas to conditionally execute based on when they seem most valuable) are welcome.

See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/622359 for a very simplistic proof of concept.

Thoughts?

dcausse added subscribers: EBernhardson, dcausse.EditedFri, Sep 18, 12:47 PM

I like the idea of using the wikidata graph (via SPARQL) to explore possibilities of pulling interesting data to feed a query expansion engine.
Using WDQS for serving real time search traffic on the other hand is not an option I think (for perf reasons) but I believe it could make sense to create a dedicated dataset using the findings you've made here. This dataset could be used for two purposes:

  • the initial concept lookup (replacing the need to use wikidata fulltext search)
  • the expansion of the concepts following certain paths of the graph like you experimented

It shouldn't be too hard to build such dataset and push it elasticsearch (on a weekly basis):

  • we will soon have the full wikidata graph replicated to hdfs weekly (the commons graph should not be to hard to add as well enabling interesting joins)
  • @EBernhardson has built a pipeline that can transfer data from hdfs to production elasticsearch

having an offline process might also help to do things like:

  • filter entities we really want (e.g. drop scientific papers)
  • join with other datasets available in hdfs (popularity, wikipedia dumps)

that would be almost impossible by reusing existing services.