Page MenuHomePhabricator

Use suggested properties to get signal for completeness
Closed, ResolvedPublic

Description

@Halfak and I found that the scripts in https://github.com/Wikidata-lib/PropertySuggester-Python, are used for generating the data in wbs_propertypairs.

However, the current scripts can only be used for property-pairs "instance of" and "subclass of". We might need to rewrite the scripts in order to include other property-pairs.

Below are the steps that proposed by @Halfak in order to do this:
(1) Run the set of scripts for generating propertypairs against a database dump.
(2) Write a new script that generates a proposed set of pid1, qid1 for inclusion based on some version of optimal coverage/clustering
(3) review the results and iterate if necessary (e.g. we might find pairs that we want to exclude because their coverage doesn't help us)
(4) Estimate how much more space it will take to store the extended property tables and propose a switch
(5) review with hoo et al. and get it deployed.
(6) Build features that use this table to assess features of completeness through API calls.

Event Timeline

Halfak claimed this task.Apr 25 2017, 6:46 PM
Halfak added a comment.May 2 2017, 2:41 PM

Discussion posted here: https://www.wikidata.org/wiki/User_talk:Glorian_WD/Clustering_Result_v2

TL;DR: this doesn't look like it is going to be fruitful in the short term, so let's move forward using wbs_propertypair

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.May 5 2017, 3:07 PM
Halfak reassigned this task from Halfak to Glorian_WD.May 8 2017, 4:59 PM

For the time being, we've decided to move forward using the existing wbs_propertypair in order to get the signal. I have created a ticket for this: https://phabricator.wikimedia.org/T164994. Moreover, there's already a patch which associated with the mentioned ticket, that is, https://gerrit.wikimedia.org/r/#/c/356043/7.

daniel added a subscriber: daniel.Jun 15 2017, 1:23 PM

Suggestion: don't try to assess completeness directly, assess incompleteness. The incompleteness score could be defined as the sum (or average, or max) of scores of the suggestions given by the suggestion API. The idea here is that something is more complete if there are few things that can be suggested with high confidence.

@daniel see discussion in T164994: Enable wbgetsuggestions API to get recommended properties even if they have existed in an item. We need to be able to divide by *something* in order to get signal. If there are few high probability suggestions for an item, we should not punish the score. We can't assess incompleteness without assessing completeness and vise versa.

If there are few high probability suggestions for an item, we should not punish the score.

Why not? I thought that was exactly the point?

We can't assess incompleteness without assessing completeness and vise versa.

To me, these two are simply inverse of each other...

@daniel, exactly! I'm not sure if we're understanding each other. I thought my equation would have made things 100% clear. Could you please review the equations in T164994: Enable wbgetsuggestions API to get recommended properties even if they have existed in an item and then respond again?

Halfak removed Glorian_WD as the assignee of this task.Jul 6 2017, 2:58 PM
Halfak lowered the priority of this task from Normal to Low.
Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 1 2017, 10:44 PM
hoo added a subscriber: hoo.Jul 16 2018, 7:11 PM
hoo added a comment.Aug 1 2018, 11:30 AM
(1) Run the set of scripts for generating propertypairs against a database dump.

Whenever we (mostly I) run the script for Wikidata we put the results up at https://github.com/wmde/wbs_propertypairs. The last version there is not very recent, so I just started the script for generating a new version (which should be there by Friday).

hoo added a comment.Oct 2 2018, 7:02 PM

I have first shot at this up at https://github.com/mariushoch/revscoring/blob/bug/T158430/T158430.py, but this doesn't yet fully work as the required API change will only be deployed tomorrow.

I just implemented https://github.com/wikimedia/revscoring/pull/414 which will give us a datasource on top of which to build the feature @hoo has been working on.

Halfak renamed this task from [Spike] Use suggested properties to get signal for completeness to Use suggested properties to get signal for completeness.Jan 17 2019, 9:46 PM
Halfak moved this task from Active to Pending deployment on the Scoring-platform-team (Current) board.

Change 489240 had a related patch set uploaded (by Halfak; owner: Halfak):
[mediawiki/services/ores/deploy@master] General updates.

https://gerrit.wikimedia.org/r/489240

Change 489240 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] General updates.

https://gerrit.wikimedia.org/r/489240

Ladsgroup closed this task as Resolved.Wed, Apr 17, 6:27 PM
Ladsgroup claimed this task.
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptWed, Apr 17, 6:27 PM