Page MenuHomePhabricator

[RfC] Property suggester suggests human properties for non-human items
Open, HighPublic

Description

The property suggester keeps suggesting completely inappropriate human properties to me on items which are not humans.

For example, right now, https://www.wikidata.org/wiki/Q504582 has the properties

  • P31 (instance of)
  • P373 (Commons category)
  • P625 (coordinate location)
  • P17 (country)
  • P18 (image)
  • P935 (Commons gallery)
  • P885 (origin of the watercourse)
  • P1599 (GeoNames ID)
  • P646 (Freebase ID)
  • P214 (VIAF ID)

Of those, 6 are generic and can apply to a variety of items, 3 are specific to geographical features and 1 is fairly generic but not usually found on humans.

Despite that, if I go to add a new property, the suggested properties are:

  • P131 (located in the administrative territorial entity)
  • P21 (sex or gender)
  • P569 (date of birth)
  • P735 (given name)
  • P27 (country of citizenship)
  • P106 (occupation)
  • P19 (place of birth)

The first of these would be a good property. The other 6 are all specific to humans (and other living beings) and should definitely not be added to rivers.

I can't see why it's so biased towards human properties here. I would expect to see properties relating to rivers (e.g. P403 (mouth of the watercourse) would actually be a useful suggestion) or at least properties relating to geographical features rather than humans.


Suggested solutions from discussion (please complete):

  • allow weighting of properties with classifying values (P31 / P279) , compare to others.
  • add P106 to properties with classifying values
  • if item has P17, check if P17 on property matches (same value or no property), if not: don't suggest property

Patch-For-Review:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL [2016-08-16T12:53:43Z] <hoo> Put a better workaround for T132839 in place: Only remove property pairs with context = "item". This keeps ref and qualifier pairs for ext ids intact.

Mentioned in SAL [2016-08-24T21:08:26Z] <hoo> Ran DELETE FROM wbs_propertypairs WHERE pid1 = '641' on Wikidata for T132839

hoo added a comment.Aug 28 2016, 4:51 PM

Updated the workaround further, per @Sjoerddebruin:

hoo@terbium:~$ bash T132839-Workarounds.sh 
Removing ext ids in item context
Batch 1: 0 rows

Removing P641 in item context
Batch 1: 0 rows

Removing P1344 in item context
Batch 1: 66 rows
Batch 2: 0 rows

Removing P463 in item context
Batch 1: 110 rows
Batch 2: 0 rows

Mentioned in SAL [2016-08-28T16:51:23Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

Mentioned in SAL [2016-08-28T16:51:23Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

Mentioned in SAL [2016-09-07T01:07:24Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2016-09-14T15:50:56Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

Mentioned in SAL (#wikimedia-operations) [2016-09-14T15:50:56Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

hoo added a comment.Sep 14 2016, 3:51 PM

Further updated the workaround:
It now also remove suggestions based on P18 (image) and P373 (commons category).

Mentioned in SAL (#wikimedia-operations) [2016-10-04T13:40:34Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

hoo added a comment.Oct 7 2016, 4:11 PM

I thought about this for a bit and have the following improvement in mind, which is going to work on the data structure we currently have, thus we can do a rather minimal change in the extension code in order to achieve this.

The current model being used is described on a very high level in T132839#2270026.

My suggestion:
For all properties used on an Item, get the probabilities for their use together with other properties. Average this across all used properties then.

Additionally we could experiment with weighting the average. This could be done by data type (so that we could for example make external ids weigh in less). We could also try to put less weight on properties that are being used together with a lot of other properties, as that might indicate that the property is not well suited for getting topic specific suggestions.

Mentioned in SAL (#wikimedia-operations) [2016-10-13T10:31:25Z] <hoo> Ran (updated) T132839-Workarounds.sh from my home in terbium

Mentioned in SAL (#wikimedia-operations) [2016-11-02T12:11:32Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

hoo added a comment.Nov 6 2016, 2:39 PM

I poked at this a bit on Thursday and Friday and came up with a new idea which will (hopefully) significantly improve the suggestions given.

Currently there are two types of correlations that the suggester considers:

  1. "Classifying" ones ("instance of" and "subclass of") where we take into account the Property id and the value of Statements.
  2. Non-classifying correlations, where only the fact that a Statement with a certain Property id exists on an Item is considered.

Right now these two types of correlations are treated equally when suggesting new Properties to use.

During playing around with various options, I figured that the suggestions based on the "classifying" correlations are usually way better than the ones which are based purely on the fact that two Properties are often used together. Due to this, we decided to implement a setting which will allows us to adjust the weight given to the correlation types ins question.

The pull request for this is at https://github.com/Wikidata-lib/PropertySuggester/pull/179 and the change will need a new PropertySuggester 4.0.

Once this has been deployed, we can undo the workarounds for this bug and then see what the right weight for classifying correlations should be. In my tests rather "extreme" values like 0.75 : 0.25 or even 0.8 : 0.2 worked best, so I would suggest trying these for starters.

Note: Suggestions for qualifiers and references, and suggestions for Items without instance of/ subclass of wont be affected by this at all.

hoo claimed this task.Nov 6 2016, 2:39 PM
Esc3300 added a subscriber: Esc3300.Nov 7 2016, 4:05 PM

Maybe P106 and P17 could be a "classifying" ones as well.

hoo added a comment.Nov 8 2016, 10:20 AM

Maybe P106 and P17 could be a "classifying" ones as well.

P106 probably could, given it's only used on humans. I also thought about this myself before… please create a separate ticket for this.
P17 is used on so many different subjects (buildings, cities, streets, …), that boosting it might result in weird correlations.

I thought about using them in combination with P31/P279.

  • For P106, it would probably only be P31=Q5.
  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.
thiemowmde renamed this task from Property suggester suggests human properties for non-human items to [RfC] Property suggester suggests human properties for non-human items.Nov 8 2016, 2:33 PM
thiemowmde moved this task from Proposed to Review on the Wikidata-Sprint-2016-11-08 board.
thiemowmde added a project: Proposal.
hoo added a comment.Nov 9 2016, 3:58 PM

I thought about using them in combination with P31/P279.

  • For P106, it would probably only be P31=Q5.

Given P106 is probably not (widely) used in any other contexts, we can just assume this to be the case, I presume.

  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.

Such a correlation is quite far away given the data and the code we currently have. I'm trying to keep the algorithm here as simple as possible and only make it more advanced if we really need to.

Given P106 is probably not (widely) used in any other contexts, we can just assume this to be the case, I presume.

Yes, mostly Q5 per Constraint_violations/P106#Types_statistics

  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.

Such a correlation is quite far away given the data and the code we currently have. I'm trying to keep the algorithm here as simple as possible and only make it more advanced if we really need to.

Indeed: I tried to do some queries on the combination P31/P17. There seem to be too many different P31.

Maybe it could be checked instead if P17 on the item isn't different from P17 on the property.

Esc3300 updated the task description. (Show Details)Nov 11 2016, 1:18 PM

Mentioned in SAL (#wikimedia-operations) [2016-11-30T02:07:35Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-01-09T17:51:54Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

thiemowmde updated the task description. (Show Details)
thiemowmde added subscribers: Jonas, aude.
Esc3300 added a comment.EditedJan 26 2017, 10:28 AM

For items with P31/P279 values that have "properties for this type" (P1963), that properties listed there could be suggested in priority.

Sample: for people, this would be all properties at Q5#P1963.

thiemowmde updated the task description. (Show Details)Feb 2 2017, 9:29 AM
thiemowmde moved this task from Review to Doing on the Wikidata-Former-Sprint-Board board.

Mentioned in SAL (#wikimedia-operations) [2017-02-08T09:55:00Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

thiemowmde updated the task description. (Show Details)Mar 13 2017, 5:12 PM

Moving back to the "review" column because of https://github.com/Wikidata-lib/PropertySuggester/pull/192. This is a quite trivial refactoring, and waiting for a review for 2 months now. Please move back to "doing" after this got merged, because I will rebase https://github.com/Wikidata-lib/PropertySuggester/pull/179 then, and hand over to @hoo who is still planning to continue working on it.

thiemowmde moved this task from in progress to consider for next sprint on the Wikidata board.

@hoo, we talked about this ticket during todays engineering time, and decided to remove this from the sprint board for the moment. Priority is still high because we want to pick this up again as soon as we can. But this is currently blocked on you. I remember you saying we should not merge your patch https://github.com/Wikidata-lib/PropertySuggester/pull/179, because you have an idea and want to improve it first. When do you think this can happen?

If you have time, please start with reviewing https://github.com/Wikidata-lib/PropertySuggester/pull/192 first. This is a minor refactoring split from your patch. I think this will make your patch easier to review (but you might disagree). I will happily do the necessary rebase, if you want.

Mentioned in SAL (#wikimedia-operations) [2017-04-04T16:03:32Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-05-11T18:39:00Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-05-31T15:58:53Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-07-26T11:02:31Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 26 2017, 11:02 AM

Mentioned in SAL (#wikimedia-operations) [2017-08-30T17:27:46Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-10-04T21:07:26Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-11-02T21:38:22Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-12-15T01:22:46Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2018-03-15T12:54:53Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

hoo added a comment.Apr 11 2018, 9:03 PM

Just to give an idea of what I was working on last:

Right now we take the correlation based on all properties that are present on an Item (just by their presence!). Also we take the the correlation based on the instance of/ subclass of values. This is when being aggregated by taking the average probability (equally ranking the property presence and the instance of/ subclass of derived probabilities). This usually leads to the instance of/ subclass of correlations having very little influence, as they are outweighed by the sheer number of other properties used on an Item fast.

My suggestion now (and I already had some code for that on GitHub) was to make it possible to weight these types of correlation, so that for example the instance of/ subclass of correlations have the same (0.5) weight than all property presence based correlations.

From where we could take further (still simple) steps like discrim

Mentioned in SAL (#wikimedia-operations) [2018-05-18T09:43:45Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

I'd rank primarily by P31/P279 .. otherwise everything is assimilated to a PubMed item

Mentioned in SAL (#wikimedia-operations) [2018-06-14T12:54:53Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2018-07-19T15:59:59Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2018-10-12T10:27:39Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-03-19T17:47:47Z] <Lucas_WMDE> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds (T216270)

Mentioned in SAL (#wikimedia-operations) [2019-04-16T08:40:49Z] <hoo> Updated the Wikidata property suggester with data from the 2019-04-08 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-05-24T11:45:44Z] <hoo> Updated the Wikidata property suggester with data from the 2019-05-13 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-07-10T19:45:04Z] <hoo> Updated the Wikidata property suggester with data from the 2019-07-01 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-08-20T16:24:58Z] <hoo> Updated the Wikidata property suggester with data from the 2019-08-12 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-10-10T20:13:44Z] <hoo> Updated the Wikidata property suggester with data from the 2019-09-30 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-11-11T10:52:36Z] <hoo> Updated the Wikidata property suggester with data from the 2019-11-04 JSON dump and applied the T132839 workarounds

We should solve this properly once for good.

@Addshore Can we add this to technical exploration?

Mentioned in SAL (#wikimedia-operations) [2020-01-22T12:57:56Z] <hoo> Updated the Wikidata property suggester with data from the 2020-01-06 JSON dump and applied the T132839 workarounds

Addshore added subscribers: darthmon_wmde, WMDE-leszek.

Is there a seperate ticket for automating / fixing the update process?
If this version of the property suggester is going to continue being used then it would make sense to tackle this.
I believe over the last years there were thoughts about just replacing it though?
I'll defer to @Lydia_Pintscher @WMDE-leszek @darthmon_wmde

There is no separate ticket I'm aware of. I am not sure if tackling this or replacing it is better. I am clear however on the feature being needed.

I believe over the last years there were thoughts about just replacing it though?

Before (if) this happens the existing solution gotta be automated. The manual approach is questionable at least.

Is there a seperate ticket for automating / fixing the update process?

As apparently there is none, I am going to create one.

Aklapper updated the task description. (Show Details)Feb 19 2020, 3:05 PM

https://github.com/Wikidata-lib/PropertySuggester/pull/179 (mentioned in the task description) has been closed and points to https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/PropertySuggester but I cannot find a related patch in Gerrit.

@hoo: And should this task still be open and still be assigned to you? If yes, do you have a link to that last patch?

Mentioned in SAL (#wikimedia-operations) [2020-03-09T14:56:41Z] <hoo> Updated the Wikidata property suggester with data from the 2020-03-02 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2020-04-21T18:28:35Z] <hoo> Updated the Wikidata property suggester with data from the 2020-04-06 JSON dump and applied the T132839 workarounds