Page MenuHomePhabricator

Research relationships between items in wikidata
Closed, ResolvedPublic


Research relationships between items in wikidata in order to come up with sensible strategies to traverse statement graphs and populate the search index with meaningful statements related to the original statement

Event Timeline

Cparle triaged this task as High priority.Jul 9 2018, 3:36 PM
Cparle created this task.

It is difficult to traverse related statements without creating a great deal of noise, and difficult to find what you want even when you do

Example - traversing the 'subclass of' chain for 'clarinet' gives 16 results, 3 of which are actually useful

Also, traversing the subclass tree doesn't necessarily give you the data you want. Traversing the 'subclass of' chain for 'poodle' gives 'dog' and 'pet' but not 'mammal' (I"d consider 4 of the 18 results to be useful)

To find out that an ant is an insect you have to traverse the 'parent taxon' rather than the 'subclass of' chain ... and 'insect' is only 1 of 25 results

... and this won't work for a wasp, because Wasp (Q9458574) is a common name for a subclass of the taxon Hymenoptera (Q22651) and does not itself have a parent taxon.

And then there's the woman problem ... traversing the 'subclass of' chain for 'woman' doesn't get you 'human'.

It used to once, but the 'subclass of' link between 'woman' and 'human' (Q5) was removed by this edit

In conclusion, adding statements based on traversing relationships between statements at index time seems likely to

  1. add a lot of unwanted statements
  2. fail to add statements we actually want

We could mitigate number 1 by using whitelists/blacklists, but either implies a large amount of config data in code (which would be difficult for the community to modify)

We could mitigate number 2 by using a set of relationship chains (rather than just one) that we want to traverse (e.g. subclass of, parent taxon, part of, has part, etc), but the above suggests that the set expands very quickly (and creates its own noise)

In conclusion - this is probably not a good approach. Perhaps the answer is bots?

I'd say don't give up too easily. This is probably as good an approach as any. If the issues are structural, bots will fall prey to them in just the same way, just more slowly and more haphazardly.

But it probably is going to need a fair number of iterations to slowly get nearer to being "right" -- this isn't something that's going to be got right first time, not even nearly.

As regards (2) I am not convinced that you really are going to see that serious a blow-up.

Here are variants of your poodle and your wasp queries, and, with

wd:Q38904 wdt:P31?/(wdt:P279|(p:P31/pq:P642)|wdt:P171)* ?item

as their path statement, using p:P31/pq:P642 as a workaround to include the (horrible) "common name ... of" link, and allowing each step to traverse either via that or up a subclass link or up a taxon link. It still only returns 72 and 43 items respectively, so that's not much worse (ie bigger) than you were generating.

One of the specific key problems in this case, adding confusion to the list, is the specification of "dog" or "canis familiaris" (via "taxon") as a "name", leading to a whole slew of abstract items. This frankly is a nonsense, and the community needs to be told to get its act in order -- this modelling is having serious consequences for item interpretation. Across the rest of Wikidata, items represent things, not names of things. Yes, a taxon in many respects is a name, but we're using it to refer to a thing, and that needs to be the clear priority for the items. The fact that a particular taxon item also has the quality of representing a name needs to be represented in a different way, not by making "taxon" subclass of "name". I suspect it may take quite harsh pressure to actually impose this, but I think this is the kind of area where it might be quite useful for the tech and community liason team to strongly suggest to the community that the current modelling is having significant difficulties.

As for "instance of" + "common name" + "of", that's a nonsense, and the sooner we have a new specific property to express that relation specifically, the better.

But ... even fixing the "taxon" subclass of "name" issue is not going to solve the question of finally ending up with weird stuff from the top of the chain. A couple of months ago I wrote a query to pull out some items that were descended from both "physical object" and "abstract object" (discussion) Our ontology at Wikidata is a mess, in so many places, and will likely take years to slowly resolve (if ever).

Besides, in many of these cases, you probably want to have items in the left-hand column discoverable for searches in the right-hand ("abstract") column -- eg you probably do want examples of African masks to come up in a search for "African art", even if Wikidata considers the former a concrete thing and the latter an abstract thing. So if you follow the subclass tree further up, you will get to all those bonkers very broad abstract concepts, which African masks are definitely not examples of.

But the good news (contra your conclusion #1) is that those are likely to be the same items every time that you will want to exclude via your stop-list, and you can probably define them by saying "everything in the subclass (P 279) tree above this item", for quite a short list of items. One could even write that into the query fairly easily with a MINUS clause, giving something like this, though that might not be the most efficient way to do it for production use.

And, yes, you will probably need a whitelist too -- for example, adding "woman" as a search term for every human that is female (if you're okay with that including depicted females that happen to be children). Also, it seems you're going to need to add "human" as a search term for "female". (btw I have no idea why Infovarius made this edit. It would seem to be something well worth bringing up at Wikidata's "Project Chat" discussion page, including a ping to Infovarius to see whether he would state his comment).

One thing I would suggest, though, is setting up a Wikidata WikiProject page for the project, advertising it on Project Chat, and then discussing or reviewing your thoughts for particular parts of the subject tree there. Yes, involving the community in discussion will add a huge time overhead; but with luck you may get people coming forward that really know their ways around particular parts of the project tree and may make some excellent suggestions -- or that may realise that some particular bits of ontology are causing some real problems, that would benefit from a root-and-branch rethink.

Wow, thanks for your very detailed comment @Jheald You've given me a lot to think about

Aklapper changed the edit policy from "Custom Policy" to "All Users".Sep 17 2018, 5:50 PM
Aklapper changed Risk Rating from N/A to default.

Without having a decrease of the unwanted results given by the queries above, you can improve easily the results. If no one puts in wikidata "poodle " somewhere in the chain ''Canis lupus familiaris '' Q26972265, then it's pretty normal that you do not get the results inside a taxon chain. As well if no one link "woman" with "human" somewhere in the chain, then you will never have "human" as result.

If the queries above do not give satisfactory results it is because the elements which must be linked inside Wikidata, are currently not.
If someone want that a "woman" be somewhere under "human", then do it with the right statment. That's all.

And to facilitate research, on each items should be defined 1,2 or x "prefered statments". One example in "taxon name" and "parent taxon" should be prefered...

And a very effective thing would be to combine the possibility to define "prefered statments" on each items, and also, the possibility to define "prefered depict tags" on the Wikimedia Commons files...

On Wikidata this should be done via the properties, indeed that is likely much more easy to define "taxon name" as a prefered statment every where it is used, than to go on each items where it is used, and to chose if it is or not a prefered statment. A list of properties should be defined as "prefered statments" in Wikidata.

See here for an indication of how difficult it is to have a sensible discussion about, much less resolve, these kinds of issues:

Folks, if nobody has any objection I think this ticket can be closed - as far as I can see we've got as far as we can with this for the purposes of Structured Data on Commons

Oops, didn't mean to actually close it just now ... but if nobody objects I will in maybe a week or so

Research phase on this is fairly well covered now. Will address the steps to improve the data quality situation separately.