Page MenuHomePhabricator

Analysis: Property usage by items' P31
Closed, ResolvedPublic


As a WDQS user, I don't want a large unrelated subgraph to affect the performance of my query when there is a shared property, so that my seemingly simple queries don't time out or take a long time to complete.

It is interesting to understand how properties are used by different content subgraphs (for instance humans, scholarly articles etc). It would allow us to better understand how properties used in a certain query context can be affected performance-wise by their usage in other contexts. For instance, the main-topic property when used for books could suffer from the property being widely used for scholarly-articles (a huge subgraph).
This analysis would use the P31 values of items to try to cluster items into groups (maybe we could even be better in using P279?), and we would count property usage by group to do further analysis.

Event Timeline

Isn't that what property suggestor does?

Some analysis was done here:

Suggested analysis:

  • Categorize usage type of properties:
    • Similar distribution of use across subgraphs
    • Have X% usage in Y subgraphs
    • Used in lots of small subgraphs, used in small quantity in all subgraphs
    • Entropy over the power-law distribution of the property across subgraphs (spark udf entropy)
      • This will give us a single number to represent the distribution of a property
      • WIll incorporate the distribution as well as the variability of property usage
  • The entropy distribution will tell us what kinds of properties we have on hand

The suggested analysis could be done through a new ticket if required later on.