Enable wbgetsuggestions API to get recommended properties even if they have existed in an item
Closed, ResolvedPublic

Description

Currently, if wbgetsuggestions is used on an item which have more than 15 different statements, it will not give any property suggestions. For instance, https://www.wikidata.org/w/api.php?action=wbsgetsuggestions&entity=Q42

I want to tweak wbgestsuggestions to give recommended properties even if they have existed in the item, so that I know which properties are most relevant to that item.

The information about relevant properties allows me to develop a feature for revscoring (https://www.mediawiki.org/wiki/ORES) which signals item completeness.
For developing the feature, we'll use a simple weighted sum based on the probability reported via the wbs_propertypairs table to get signal for completeness. Items with all high probability statements complete should be more likely to *be* complete than items that lack high probability statements.

Concerning to the result limit in the API, we plan to do "continuation" to workaround the limit that is imposed by the API.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2017, 9:56 PM
hoo added a subscriber: hoo.May 11 2017, 8:11 PM

We don't have a hard limit in the suggester, but we have a probability cut-off. So if an item has a lot of statements already, we might just not find any further relevant properties.

WMDE-leszek triaged this task as Normal priority.May 30 2017, 8:18 AM
WMDE-leszek added a project: Patch-For-Review.
WMDE-leszek added a subscriber: WMDE-leszek.
Glorian_WD added a comment.EditedJun 5 2017, 8:29 PM

During my time playing around with the code, I noticed that, if an item only have one statement in which its property is not "P-31" (instance-of) or "P-279" (subclass-of), and the property also does not exist in the wbs_propertypairs table, the PropertySuggester won't suggest any properties.

I think that, in that case, it should suggest property "P-31" (instance-of) or "P-279" (subclass-of) to be added into the item. However, I didn't add this capability into the patch that I've submitted on https://gerrit.wikimedia.org/r/#/c/356043, because maybe, you have a certain consideration in mind concerning to this case.

Glorian_WD added a subscriber: aude.Jun 5 2017, 8:38 PM
daniel added a subscriber: daniel.Jun 12 2017, 5:02 PM

During my time playing around with the code, I noticed that, if an item only have one statement in which its property is not "P-31" (instance-of) or "P-279" (subclass-of), and the property also does not exist in the wbs_propertypairs table, the PropertySuggester won't suggest any properties.

Yes, only properties that are in the table can be suggested. If no such properties are found, the code should ideally behave as if there was no property on the item. In that case, suggestions are based simply on which properties are used most often. We don't do an explicit check for this, but perhaps we should. If we update the table often enough, this should not be a problem, though: new properties are rarely used.

Glorian_WD updated the task description. (Show Details)EditedJun 14 2017, 5:14 PM

@daniel, @WMDE-leszek, & @aude : As requested, I have added further details on this ticket description.

@Halfak : Feel free to modify or add some details that I may miss. Thanks :)

Glorian_WD updated the task description. (Show Details)Jun 14 2017, 5:24 PM
Halfak updated the task description. (Show Details)Jun 14 2017, 8:15 PM
Glorian_WD updated the task description. (Show Details)Jun 14 2017, 8:27 PM

For developing the feature, we'll use a simple weighted sum based on the probability reported via the wbs_propertypairs table to get signal for completeness. Items with all high probability statements complete should be more likely to *be* complete than items that lack high probability statements.

So, let's see if I gut this right.

  • let's say Q5 has statements about P3 and P5. Based on that, we compute probabilities for P1, P2, P3, P4, P5, P6, and P7. The probability of P1, p(P1), is the scaled sum of the co-occurance probability: p(P1) = sum( co(P1, P1), co(P1, P2), co(P1, P3)... co(P1, P9) ) / 9. It would perhaps be more semantically useful to use a maximum here, but that's not what we currently do.
  • Currently, the API would output probabilities for P1, P2, P4, P6, and P7, ordered by probability.
  • If I understand correctly, what you want are just the probabilities for P3 and P5. No need to get all! So just limit the output to the probabilities. Or even the sum of these - that's what you really want, right? If the output filtering is inverted instead of omitted, you get much smaller results, and you will probably not need paging/continuations. But there is a semantic snag here. The co-occurrence probability of anything with itself is 1. So an item that has only one property, P1, will give you p(P1) = 1, and the total completeness score would also be 1, meaning 100% perfect. That's not what you meant, is it?
  • Or maybe you want the sum of the properties missing - an incompleteness score? That seems more useful, but it's not what I gather from what you wrote. But you could get that from the current API output: just sum the probabilities of the suggestions you get! If you want a completeness score, just use 1/n.

If I understand correctly, what you want are just the probabilities for P3 and P5.

Nope. We want all. We need to know how important the statements are that exist as well as how important the statements that don't exist are.

In your example need to compute (p(P3) + p(P5)) / (p(P1) + p(P2) + p(P3) + p(P4) + p(P5) + p(P6) + p(P7)). The proportion of potential "probability" that is covered by the current statements (completeness). 1 - completeness = missingness. Remember that in a modeling context, this value doesn't need to be perfect. It needs to care useful *signal* of what constitutes a high quality item.

@Halfak: Ok, I see. This leaves us with the paging problem. There is no efficient way to do paging on ranked results. But since the number of rows is < 10k, it should be ok I guess. Needs to be added to the API module, though. Or are you OK with a cutoff at 50 (or 100)?

Yeah! I think a cutoff of 50 or 100 would be totally reasonable :)

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 1 2017, 10:45 PM

This seems stalled, what's the status here @Glorian_WD ?

@hoo might take a look at this. Note that the discussion above covers quite a bit of the details necessary for implementing this :)

Change 454043 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/PropertySuggester@master] Add wbsgetsuggestions include parameter

https://gerrit.wikimedia.org/r/454043

hoo claimed this task.Aug 20 2018, 3:56 PM

@hoo, any updates? Seems like this task has been stagnant for a few weeks.

hoo added a comment.Sep 24 2018, 2:51 PM

@Halfak The patch is currently waiting to be reviewed.

@Ladsgroup, could you take a look?

Change 454043 merged by jenkins-bot:
[mediawiki/extensions/PropertySuggester@master] Add wbsgetsuggestions include parameter

https://gerrit.wikimedia.org/r/454043

hoo closed this task as Resolved.Sep 26 2018, 11:47 PM
hoo removed a project: Patch-For-Review.