Page MenuHomePhabricator

[RfC] Property suggester suggests human properties for non-human items
Open, Stalled, HighPublic

Description

The property suggester keeps suggesting completely inappropriate human properties to me on items which are not humans.

For example, right now, https://www.wikidata.org/wiki/Q504582 has the properties

  • P31 (instance of)
  • P373 (Commons category)
  • P625 (coordinate location)
  • P17 (country)
  • P18 (image)
  • P935 (Commons gallery)
  • P885 (origin of the watercourse)
  • P1599 (GeoNames ID)
  • P646 (Freebase ID)
  • P214 (VIAF ID)

Of those, 6 are generic and can apply to a variety of items, 3 are specific to geographical features and 1 is fairly generic but not usually found on humans.

Despite that, if I go to add a new property, the suggested properties are:

  • P131 (located in the administrative territorial entity)
  • P21 (sex or gender)
  • P569 (date of birth)
  • P735 (given name)
  • P27 (country of citizenship)
  • P106 (occupation)
  • P19 (place of birth)

The first of these would be a good property. The other 6 are all specific to humans (and other living beings) and should definitely not be added to rivers.

I can't see why it's so biased towards human properties here. I would expect to see properties relating to rivers (e.g. P403 (mouth of the watercourse) would actually be a useful suggestion) or at least properties relating to geographical features rather than humans.


Suggested solutions from discussion (please complete):

  • allow weighting of properties with classifying values (P31 / P279) , compare to others.
  • add P106 to properties with classifying values
  • if item has P17, check if P17 on property matches (same value or no property), if not: don't suggest property

Patch-For-Review:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Maybe P106 and P17 could be a "classifying" ones as well.

Maybe P106 and P17 could be a "classifying" ones as well.

P106 probably could, given it's only used on humans. I also thought about this myself before… please create a separate ticket for this.
P17 is used on so many different subjects (buildings, cities, streets, …), that boosting it might result in weird correlations.

I thought about using them in combination with P31/P279.

  • For P106, it would probably only be P31=Q5.
  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.
thiemowmde renamed this task from Property suggester suggests human properties for non-human items to [RfC] Property suggester suggests human properties for non-human items.Nov 8 2016, 2:33 PM
thiemowmde moved this task from Proposed to Review on the Wikidata-Sprint-2016-11-08 board.
thiemowmde added a project: Proposal.

I thought about using them in combination with P31/P279.

  • For P106, it would probably only be P31=Q5.

Given P106 is probably not (widely) used in any other contexts, we can just assume this to be the case, I presume.

  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.

Such a correlation is quite far away given the data and the code we currently have. I'm trying to keep the algorithm here as simple as possible and only make it more advanced if we really need to.

Given P106 is probably not (widely) used in any other contexts, we can just assume this to be the case, I presume.

Yes, mostly Q5 per Constraint_violations/P106#Types_statistics

  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.

Such a correlation is quite far away given the data and the code we currently have. I'm trying to keep the algorithm here as simple as possible and only make it more advanced if we really need to.

Indeed: I tried to do some queries on the combination P31/P17. There seem to be too many different P31.

Maybe it could be checked instead if P17 on the item isn't different from P17 on the property.

Mentioned in SAL (#wikimedia-operations) [2016-11-30T02:07:35Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-01-09T17:51:54Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

For items with P31/P279 values that have "properties for this type" (P1963), that properties listed there could be suggested in priority.

Sample: for people, this would be all properties at Q5#P1963.

Mentioned in SAL (#wikimedia-operations) [2017-02-08T09:55:00Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Moving back to the "review" column because of https://github.com/Wikidata-lib/PropertySuggester/pull/192. This is a quite trivial refactoring, and waiting for a review for 2 months now. Please move back to "doing" after this got merged, because I will rebase https://github.com/Wikidata-lib/PropertySuggester/pull/179 then, and hand over to @hoo who is still planning to continue working on it.

thiemowmde moved this task from in progress to consider for next sprint on the Wikidata board.

@hoo, we talked about this ticket during todays engineering time, and decided to remove this from the sprint board for the moment. Priority is still high because we want to pick this up again as soon as we can. But this is currently blocked on you. I remember you saying we should not merge your patch https://github.com/Wikidata-lib/PropertySuggester/pull/179, because you have an idea and want to improve it first. When do you think this can happen?

If you have time, please start with reviewing https://github.com/Wikidata-lib/PropertySuggester/pull/192 first. This is a minor refactoring split from your patch. I think this will make your patch easier to review (but you might disagree). I will happily do the necessary rebase, if you want.

Mentioned in SAL (#wikimedia-operations) [2017-04-04T16:03:32Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-05-11T18:39:00Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-05-31T15:58:53Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-07-26T11:02:31Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-08-30T17:27:46Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-10-04T21:07:26Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-11-02T21:38:22Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-12-15T01:22:46Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2018-03-15T12:54:53Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Just to give an idea of what I was working on last:

Right now we take the correlation based on all properties that are present on an Item (just by their presence!). Also we take the the correlation based on the instance of/ subclass of values. This is when being aggregated by taking the average probability (equally ranking the property presence and the instance of/ subclass of derived probabilities). This usually leads to the instance of/ subclass of correlations having very little influence, as they are outweighed by the sheer number of other properties used on an Item fast.

My suggestion now (and I already had some code for that on GitHub) was to make it possible to weight these types of correlation, so that for example the instance of/ subclass of correlations have the same (0.5) weight than all property presence based correlations.

From where we could take further (still simple) steps like discrim

Mentioned in SAL (#wikimedia-operations) [2018-05-18T09:43:45Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

I'd rank primarily by P31/P279 .. otherwise everything is assimilated to a PubMed item

Mentioned in SAL (#wikimedia-operations) [2018-06-14T12:54:53Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2018-07-19T15:59:59Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2018-10-12T10:27:39Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-03-19T17:47:47Z] <Lucas_WMDE> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds (T216270)

Mentioned in SAL (#wikimedia-operations) [2019-04-16T08:40:49Z] <hoo> Updated the Wikidata property suggester with data from the 2019-04-08 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-05-24T11:45:44Z] <hoo> Updated the Wikidata property suggester with data from the 2019-05-13 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-07-10T19:45:04Z] <hoo> Updated the Wikidata property suggester with data from the 2019-07-01 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-08-20T16:24:58Z] <hoo> Updated the Wikidata property suggester with data from the 2019-08-12 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-10-10T20:13:44Z] <hoo> Updated the Wikidata property suggester with data from the 2019-09-30 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2019-11-11T10:52:36Z] <hoo> Updated the Wikidata property suggester with data from the 2019-11-04 JSON dump and applied the T132839 workarounds

We should solve this properly once for good.

@Addshore Can we add this to technical exploration?

Mentioned in SAL (#wikimedia-operations) [2020-01-22T12:57:56Z] <hoo> Updated the Wikidata property suggester with data from the 2020-01-06 JSON dump and applied the T132839 workarounds

Addshore added subscribers: darthmon_wmde, WMDE-leszek.

Is there a seperate ticket for automating / fixing the update process?
If this version of the property suggester is going to continue being used then it would make sense to tackle this.
I believe over the last years there were thoughts about just replacing it though?
I'll defer to @Lydia_Pintscher @WMDE-leszek @darthmon_wmde

There is no separate ticket I'm aware of. I am not sure if tackling this or replacing it is better. I am clear however on the feature being needed.

I believe over the last years there were thoughts about just replacing it though?

Before (if) this happens the existing solution gotta be automated. The manual approach is questionable at least.

Is there a seperate ticket for automating / fixing the update process?

As apparently there is none, I am going to create one.

https://github.com/Wikidata-lib/PropertySuggester/pull/179 (mentioned in the task description) has been closed and points to https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/PropertySuggester but I cannot find a related patch in Gerrit.

@hoo: And should this task still be open and still be assigned to you? If yes, do you have a link to that last patch?

Mentioned in SAL (#wikimedia-operations) [2020-03-09T14:56:41Z] <hoo> Updated the Wikidata property suggester with data from the 2020-03-02 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2020-04-21T18:28:35Z] <hoo> Updated the Wikidata property suggester with data from the 2020-04-06 JSON dump and applied the T132839 workarounds

@hoo: ping? Could you answer the previous comment, please? Should this task remain open? Thanks!

Mentioned in SAL (#wikimedia-operations) [2021-01-21T09:44:10Z] <hoo> Updated the Wikidata property suggester with data from the 2021-01-11 JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2021-02-15T16:14:34Z] <hoo> Updated the Wikidata property suggester with data from the 2021-02-01 JSON dump (with pre-applied T132839 workarounds)

Mentioned in SAL (#wikimedia-operations) [2021-03-15T18:18:00Z] <hoo> Updated the Wikidata property suggester with data from the 2021-03-08 JSON dump (with pre-applied T132839 workarounds)

Mentioned in SAL (#wikimedia-operations) [2021-04-19T16:25:03Z] <hoo> Updated the Wikidata property suggester with data from the 2021-04-12 JSON dump (with pre-applied T132839 workarounds)

Mentioned in SAL (#wikimedia-operations) [2021-06-22T14:35:00Z] <hoo> Updated the Wikidata property suggester with data from the 2021-05-31 JSON dump (with pre-applied T132839 workarounds)

Mentioned in SAL (#wikimedia-operations) [2021-07-16T00:06:22Z] <hoo> Updated the Wikidata property suggester with data from the 2021-07-12 JSON dump (with pre-applied T132839 workarounds)

Mentioned in SAL (#wikimedia-operations) [2021-10-18T23:40:20Z] <hoo> Updated the Wikidata property suggester with data from the 2021-10-04 JSON dump (with pre-applied T132839 workarounds)

Mentioned in SAL (#wikimedia-operations) [2022-02-15T22:00:29Z] <hoo> Updated the Wikidata property suggester with data from the 2022-02-07 JSON dump (with pre-applied T132839 workarounds)

Lydia_Pintscher changed the task status from Open to Stalled.Aug 16 2022, 10:59 AM

I assume this will be fixed with the new suggester being finalized in T285098. Setting to stalled for now to check back on this once we have the new suggester in place.