[RfC] Property suggester suggests human properties for non-human items
Open, HighPublic

Description

The property suggester keeps suggesting completely inappropriate human properties to me on items which are not humans.

For example, right now, https://www.wikidata.org/wiki/Q504582 has the properties

  • P31 (instance of)
  • P373 (Commons category)
  • P625 (coordinate location)
  • P17 (country)
  • P18 (image)
  • P935 (Commons gallery)
  • P885 (origin of the watercourse)
  • P1599 (GeoNames ID)
  • P646 (Freebase ID)
  • P214 (VIAF ID)

Of those, 6 are generic and can apply to a variety of items, 3 are specific to geographical features and 1 is fairly generic but not usually found on humans.

Despite that, if I go to add a new property, the suggested properties are:

  • P131 (located in the administrative territorial entity)
  • P21 (sex or gender)
  • P569 (date of birth)
  • P735 (given name)
  • P27 (country of citizenship)
  • P106 (occupation)
  • P19 (place of birth)

The first of these would be a good property. The other 6 are all specific to humans (and other living beings) and should definitely not be added to rivers.

I can't see why it's so biased towards human properties here. I would expect to see properties relating to rivers (e.g. P403 (mouth of the watercourse) would actually be a useful suggestion) or at least properties relating to geographical features rather than humans.


Suggested solutions from discussion (please complete):

  • allow weighting of properties with classifying values (P31 / P279) , compare to others.
  • add P106 to properties with classifying values
  • if item has P17, check if P17 on property matches (same value or no property), if not: don't suggest property

Patch-For-Review:

There are a very large number of changes, so older changes are hidden. Show Older Changes
Lydia_Pintscher moved this task from incoming to consider for next sprint on the Wikidata board.
Lydia_Pintscher triaged this task as High priority.

Ideas we had in todays meeting:

hoo added a comment.Apr 26 2016, 2:09 PM

Ideas we had in todays meeting:

Should not be the case, I addressed that with https://github.com/Wikidata-lib/PropertySuggester-Python/commit/6fc5610f3e0383d676c0abde20dcee7029274723 (which is applied on the host where I create the suggester data).

  • We can undo the last database update to the one from March and see if the problem is still there. If it is, the problem is code. Otherwise it's just the data we have.

Sure, we can try this for a bit.

Mentioned in SAL [2016-05-06T11:02:30Z] <hoo> Overwrote property suggester data with data from the 20160215 dump (T132839)

Mentioned in SAL [2016-05-06T11:10:01Z] <hoo> Reverted the property suggester data to data from the 20160411 dump (done testing T132839)

@hoo tried the old correlation data and the suggestions are just as bad. This indicates a problem with the code.

@hoo removed the external IDs from the correlation table. This seems to improve the situation for now. We'll still need to find a better solution though.

hoo added subscribers: mkroetzsch, daniel.EditedMay 6 2016, 2:18 PM

I've just looked into this and I think the problem is that the suggester is using a very naive way to select its suggestions. It basically queries for probable matches by each property id that is used on an item individually, which highly prefers properties that have a few high probable matches.

Put more mathematically (partly taken from the thesis about this):

Q is the item we want suggestions for and Properties(Q) is the set of properties used in Statements on it.

For each pair P1 ∈ Properties(Q), P2 ∉ Properties(Q) we look at the confidence that P1 => P2 (without taking any further context into account). We also look for the confidence (P31, Q) => P2 (where P31 is instance of). Later on a list of all these P2s (ordered by confidence) is returned (the ones found with the (P31, Q) pair and the ones found by just looking by a given P1 are treated equally).

We probably want to move away from selecting these P2s individually by P1 and try to get correlations for all Properties at the same time (Properties(Q) => P2) or by combining the individual probability of each P1 with the one from the (P31, Q) pair ({P1, (P31, Q)} => P2).

Nikki added a comment.May 6 2016, 4:31 PM

The suggestions right now seem to be better than before, e.g. for the example in the description I get P131, mouth of the watercourse, sex or gender, date of birth. That still includes human properties, but at least mouth of the watercourse actually shows up now.

Not quite the same, but presumably caused by how it decides which properties to select: I also often see country-specific properties for large countries show up as suggestions for items in other countries, e.g. https://www.wikidata.org/wiki/Q504582 currently suggests "China administrative division code" despite the item having the country set to the USA. If there's a change that would improve things like that too, that would be awesome. :)

@Lydia_Pintscher suggests to look into the code again and find out whether the problem comes from a change to Wikidata that's not reflected in PropertySuggester.

hoo added a comment.May 10 2016, 1:10 PM

@Lydia_Pintscher suggests to look into the code again and find out whether the problem comes from a change to Wikidata that's not reflected in PropertySuggester.

We already did that three times by now, I think… not sure where the point in repeating that is.

The next part to look for here would be (in my opinion) to get the exact query that the suggester is running and then try to look through the results. In the end I'm fairly sure my analysis at T132839#2270026 is correct. I find it quite unlikely that the quite naive current algorithm can give meaningful results for larger items.

We also came up with a possible improvement: Some properties like "instance of" and "Commons category" are not selective. The fact that this property exists on an item does not say anything. We think it's a good idea to add such properties to a "non-selective" blacklist (or to the existing blacklist). This should reduce noise.

Hm. I just realized that "Commons category" does tell you one thing: That there are pictures and the item should have an "image" property too.

FYI, I did an other run of code review on https://github.com/Wikidata-lib/PropertySuggester-Python and https://github.com/Wikidata-lib/PropertySuggester and could not find more suspicious code. The Python script should produce massive amounts of warnings when a datatype is missing. Does this happen? Are these logs reviewed after the script is run?

hoo added a comment.May 10 2016, 11:26 PM

We also came up with a possible improvement: Some properties like "instance of" and "Commons category" are not selective. The fact that this property exists on an item does not say anything. We think it's a good idea to add such properties to a "non-selective" blacklist (or to the existing blacklist). This should reduce noise.

We have special handling for instance of and subclass of that avoid this behaviour (these are "classifying properties"). Excluding very generic ones like identifiers and certain string ones also is probably a good idea (or, in the long run, weight them lower?).

FYI, I did an other run of code review on https://github.com/Wikidata-lib/PropertySuggester-Python and https://github.com/Wikidata-lib/PropertySuggester and could not find more suspicious code. The Python script should produce massive amounts of warnings when a datatype is missing. Does this happen? Are these logs reviewed after the script is run?

I saw it before, but very rarely (like once or twice for a dump run at some point).

I'll probably do a new dump run tomorrow and will examine the logs after, but I don't think that's going to give us any new insights.

It does not look like PropertySuggester-Python is currently applying any filtering based on data type (or value type). If we want to add such filtering, the best place would probably be in write_row in CsvWriter.php. We could also filter while reading the input file, but we would have to do this twice, in JsonReader and in XmlReader.

Note however, if we filter out properties with specific data types completely, such properties will never be suggested.

I believe this is wrong. There are "external-identifier" properties that really should be suggested the moment it becomes clear what kind of item you are editing. For example, something with "instanceof book" should get an ISBN number a.s.a.p., and the other way around. Or: an item that happens to be an Rijksmonument in the Netherlands must get an Rijksmonument ID.

I suggest to:

  1. Add unspecific identifiers that apply to all kinds of items to the $wgPropertySuggesterClassifyingPropertyIds setting.
  2. Add stuff to $wgPropertySuggesterDeprecatedIds that should never be suggested, except you search for it explicitly.
hoo added a comment.May 24 2016, 3:39 PM

[…]
I suggest to:

  1. Add unspecific identifiers that apply to all kinds of items to the $wgPropertySuggesterClassifyingPropertyIds setting.

Why? I don't see how an identifier value could ever be classifying ($wgPropertySuggesterClassifyingPropertyIds is about classifying based on values not just the properties used, although that might not be correctly implemented right now for strings/ external ids).

Thanks for the clarification, this is indeed an important difference. Properties like ISBN are not classifying by value but by the pure fact that they exist.

Mentioned in SAL [2016-05-28T19:47:45Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and removed the external identifiers as a workaround for T132839

Mentioned in SAL [2016-07-07T08:33:55Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and removed the external identifiers as a workaround for T132839

Mentioned in SAL [2016-08-08T21:55:14Z] <hoo> Updated Wikidata's property suggester with data from today's json dump and removed the external identifiers as a workaround for T132839

Mentioned in SAL [2016-08-16T12:53:43Z] <hoo> Put a better workaround for T132839 in place: Only remove property pairs with context = "item". This keeps ref and qualifier pairs for ext ids intact.

Mentioned in SAL [2016-08-24T21:08:26Z] <hoo> Ran DELETE FROM wbs_propertypairs WHERE pid1 = '641' on Wikidata for T132839

hoo added a comment.Aug 28 2016, 4:51 PM

Updated the workaround further, per @Sjoerddebruin:

hoo@terbium:~$ bash T132839-Workarounds.sh 
Removing ext ids in item context
Batch 1: 0 rows

Removing P641 in item context
Batch 1: 0 rows

Removing P1344 in item context
Batch 1: 66 rows
Batch 2: 0 rows

Removing P463 in item context
Batch 1: 110 rows
Batch 2: 0 rows

Mentioned in SAL [2016-08-28T16:51:23Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

Mentioned in SAL [2016-08-28T16:51:23Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

Mentioned in SAL [2016-09-07T01:07:24Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2016-09-14T15:50:56Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

Mentioned in SAL (#wikimedia-operations) [2016-09-14T15:50:56Z] <hoo> Ran T132839-Workarounds.sh from my home in terbium (see T132839)

hoo added a comment.Sep 14 2016, 3:51 PM

Further updated the workaround:
It now also remove suggestions based on P18 (image) and P373 (commons category).

Mentioned in SAL (#wikimedia-operations) [2016-10-04T13:40:34Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

hoo added a comment.Oct 7 2016, 4:11 PM

I thought about this for a bit and have the following improvement in mind, which is going to work on the data structure we currently have, thus we can do a rather minimal change in the extension code in order to achieve this.

The current model being used is described on a very high level in T132839#2270026.

My suggestion:
For all properties used on an Item, get the probabilities for their use together with other properties. Average this across all used properties then.

Additionally we could experiment with weighting the average. This could be done by data type (so that we could for example make external ids weigh in less). We could also try to put less weight on properties that are being used together with a lot of other properties, as that might indicate that the property is not well suited for getting topic specific suggestions.

Mentioned in SAL (#wikimedia-operations) [2016-10-13T10:31:25Z] <hoo> Ran (updated) T132839-Workarounds.sh from my home in terbium

Mentioned in SAL (#wikimedia-operations) [2016-11-02T12:11:32Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

hoo added a comment.Nov 6 2016, 2:39 PM

I poked at this a bit on Thursday and Friday and came up with a new idea which will (hopefully) significantly improve the suggestions given.

Currently there are two types of correlations that the suggester considers:

  1. "Classifying" ones ("instance of" and "subclass of") where we take into account the Property id and the value of Statements.
  2. Non-classifying correlations, where only the fact that a Statement with a certain Property id exists on an Item is considered.

Right now these two types of correlations are treated equally when suggesting new Properties to use.

During playing around with various options, I figured that the suggestions based on the "classifying" correlations are usually way better than the ones which are based purely on the fact that two Properties are often used together. Due to this, we decided to implement a setting which will allows us to adjust the weight given to the correlation types ins question.

The pull request for this is at https://github.com/Wikidata-lib/PropertySuggester/pull/179 and the change will need a new PropertySuggester 4.0.

Once this has been deployed, we can undo the workarounds for this bug and then see what the right weight for classifying correlations should be. In my tests rather "extreme" values like 0.75 : 0.25 or even 0.8 : 0.2 worked best, so I would suggest trying these for starters.

Note: Suggestions for qualifiers and references, and suggestions for Items without instance of/ subclass of wont be affected by this at all.

hoo claimed this task.Nov 6 2016, 2:39 PM
Esc3300 added a subscriber: Esc3300.Nov 7 2016, 4:05 PM

Maybe P106 and P17 could be a "classifying" ones as well.

hoo added a comment.Nov 8 2016, 10:20 AM

Maybe P106 and P17 could be a "classifying" ones as well.

P106 probably could, given it's only used on humans. I also thought about this myself before… please create a separate ticket for this.
P17 is used on so many different subjects (buildings, cities, streets, …), that boosting it might result in weird correlations.

I thought about using them in combination with P31/P279.

  • For P106, it would probably only be P31=Q5.
  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.
thiemowmde renamed this task from Property suggester suggests human properties for non-human items to [RfC] Property suggester suggests human properties for non-human items.
thiemowmde added a project: RfC.
hoo added a comment.Nov 9 2016, 3:58 PM

I thought about using them in combination with P31/P279.

  • For P106, it would probably only be P31=Q5.

Given P106 is probably not (widely) used in any other contexts, we can just assume this to be the case, I presume.

  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.

Such a correlation is quite far away given the data and the code we currently have. I'm trying to keep the algorithm here as simple as possible and only make it more advanced if we really need to.

Given P106 is probably not (widely) used in any other contexts, we can just assume this to be the case, I presume.

Yes, mostly Q5 per Constraint_violations/P106#Types_statistics

  • P17 in combination with P31 should solve the China administrative identifier problem that used to pop up.

Such a correlation is quite far away given the data and the code we currently have. I'm trying to keep the algorithm here as simple as possible and only make it more advanced if we really need to.

Indeed: I tried to do some queries on the combination P31/P17. There seem to be too many different P31.

Maybe it could be checked instead if P17 on the item isn't different from P17 on the property.

Esc3300 updated the task description. (Show Details)Nov 11 2016, 1:18 PM

Mentioned in SAL (#wikimedia-operations) [2016-11-30T02:07:35Z] <hoo> Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-01-09T17:51:54Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

thiemowmde updated the task description. (Show Details)
thiemowmde added subscribers: Jonas, aude.
Esc3300 added a comment.EditedJan 26 2017, 10:28 AM

For items with P31/P279 values that have "properties for this type" (P1963), that properties listed there could be suggested in priority.

Sample: for people, this would be all properties at Q5#P1963.

thiemowmde updated the task description. (Show Details)Feb 2 2017, 9:29 AM
thiemowmde moved this task from Review to Doing on the Wikidata-Former-Sprint-Board board.

Mentioned in SAL (#wikimedia-operations) [2017-02-08T09:55:00Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

thiemowmde updated the task description. (Show Details)Mar 13 2017, 5:12 PM

Moving back to the "review" column because of https://github.com/Wikidata-lib/PropertySuggester/pull/192. This is a quite trivial refactoring, and waiting for a review for 2 months now. Please move back to "doing" after this got merged, because I will rebase https://github.com/Wikidata-lib/PropertySuggester/pull/179 then, and hand over to @hoo who is still planning to continue working on it.

thiemowmde moved this task from in current sprint to consider for next sprint on the Wikidata board.

@hoo, we talked about this ticket during todays engineering time, and decided to remove this from the sprint board for the moment. Priority is still high because we want to pick this up again as soon as we can. But this is currently blocked on you. I remember you saying we should not merge your patch https://github.com/Wikidata-lib/PropertySuggester/pull/179, because you have an idea and want to improve it first. When do you think this can happen?

If you have time, please start with reviewing https://github.com/Wikidata-lib/PropertySuggester/pull/192 first. This is a minor refactoring split from your patch. I think this will make your patch easier to review (but you might disagree). I will happily do the necessary rebase, if you want.

Mentioned in SAL (#wikimedia-operations) [2017-04-04T16:03:32Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-05-11T18:39:00Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-05-31T15:58:53Z] <hoo> Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-07-26T11:02:31Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 26 2017, 11:02 AM

Mentioned in SAL (#wikimedia-operations) [2017-08-30T17:27:46Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-10-04T21:07:26Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds

Mentioned in SAL (#wikimedia-operations) [2017-11-02T21:38:22Z] <hoo> Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds