Page MenuHomePhabricator

Find a better solution for dewiki's Modul:Wikidata isParent
Closed, ResolvedPublic

Description

We recently saw a spike of X usages coming from dewiki's Modul:Wikidata isParent.

isParent basically tries to determine whether a certain entity is a (possibly indirect) (subclass of/instance of) of a certain item. It does that by basically walking up the (subclass of/instance of) graph by recursively loading all entities it encounters and looking at their (subclass of/instance of), up to a max depth of (per default) 5.

We should find a better solution for this, before the function sees widespread usage… otherwise this might cause severe problems.

Event Timeline

thiemowmde added a project: patch-welcome.
thiemowmde moved this task from incoming to needs discussion or investigation on the Wikidata board.
thiemowmde subscribed.

I also saw this code being the main reason for super-useless spam on users watchlists, because whenever a very high-level item gets edited on Wikidata, all child-items get a notification. In the specific example I have seen the user got hundreds of notifications about articles of artists he is watching, because someone edited something on the item "human" (yes, literally).

A short-term solution is to just remove this bad behaving maintenance code from the communities module. It really is nothing but this: maintenance. It does not influence the article (except for a hidden category).

Providing an alternative that allows the community to rewrite this code in a much more efficient way is an other thing we can do. It could be a literal isParent method that hard-codes the knowledge about what a "parent" is. Or it can be a getter that only returns the statements that are really needed (which I believe is exactly what we discuss in T166056 and https://gerrit.wikimedia.org/r/383990).

Fine grained usage tracking as discussed in T172914 could also solve this.

So I looked into this a bit, to evaluate what is currently done.

I was able to find the following variants of such a transitive lookup:

  1. dewiki-version: function p.isSubclass(frame)
  2. frwiki-versions (1, 2): function p.isSubclass(class, item, maxdepth)/function p.isInstance(targetclass, item, maxdepth) (definitions are the same in both modules, I suppose one is deprecated?)
  3. dewikivoyage-version: function fw.typeSearch( p31, list, limit )

While the dewikivoyage-version is only used on dewikivoyage, all the other scripts are used on several wikis.

In depth look:

dewiki-version (1)

See @thiemowmde's comment (T179155#3715222).

I checked (using mwgrep --module '["'"']maxDepth") that this is never used with a maxDepth other than 5.

frwiki-version (2)

The two modules have very similar definitions (see above). Module:Wikidata/Analyse transitive is no longer used on frwiki, but might be used on other wikis still. Given that both implement these functions in a very similar way, I don't think that needs further checking.
isSubclass checks whether a given item is a(n) (in)direct subclass of a given class by walking up the "subclass of" tree until a given maxdepth.
isInstance checks whether a given item is a(n) (in)direct instance of a given class. If the given item is no direct instance of this class, isSubclass is invoked with all the instance of values found on the item.

One notable thing here is that AFAICT the full list of parents is always materialized, if isSubclass is invoked. This is done until either the maxdepth limit, or a maxnodes limit (which is 100 in all cases I could find) are reached. In case a limit is hit, the result is quietly returned and the subclass relationship is decided based on this (possibly incomplete) data.

isSubclass is only used as part of isInstance, it has no direct usages (on any wiki). I checked this by running: mwgrep --module "\.isSubclass\([^\)]*,[^\)]*,[^\)]*\)".
isInstance is used in several places. I found usages with a maxdepth of 0, 1, 2 and 3 (checked on frwiki only).

dewikivoyage-version (3)

The function in question is used to map an entity's instance of values to a "POIType". To do this, each of the "instance of" values of an entity is taken, and then checked against the POI-types list. This is repeated with all super-classes of the "instance of" values (up to 4 levels high). This is immediately aborted when a POI-types is found. (Used in Modul:VCard; Modul:Marker, checked with mwgrep --module "typeSearch\(")

Just as a note:

I looked into (ab)using the pagelinks table for this purpose. If we want to link an entity to its parent, the pages in question must also be (indirectly) page-linked together. My idea was to get the (indirect) pagelinks from a to b first, and then check the entities in question to find out whether this actually is the kind of relationship we're looking for.

Such a query would look like this (find a link between an entity and a target entity with N hops in between):

SELECT pl0.pl_from, pl1.pl_from, pl2.pl_from,  plN.pl_from FROM pagelinks AS pl0

	INNER JOIN page AS p1 ON pl0.pl_namespace = p1.page_namespace AND pl0.pl_title = p1.page_title
	INNER JOIN pagelinks AS pl1 ON pl1.pl_from = p1.page_id

	INNER JOIN page AS p2 ON pl1.pl_namespace = p2.page_namespace AND pl1.pl_title = p2.page_title
	INNER JOIN pagelinks AS pl2 ON pl2.pl_from = p2.page_id

	INNER JOIN page AS pN ON pl(N-1).pl_namespace = pN.page_namespace AND pl(N-1).pl_title = pN.page_title
	INNER JOIN pagelinks AS plN ON plN.pl_from = pN.page_id
	
WHERE pl0.pl_from = PAGE_ID_OF_THE_ENTITY AND
	pl0.pl_namespace = 0 AND
	pl1.pl_namespace = 0 AND
	pl2.pl_namespace = 0 AND

	plN.pl_namespace = 0 AND plN.pl_title = 'TARGET_ENTITY_PAGE_TITLE';

I briefly looked into this, but the resulting query clearly didn't cut it, so I'm not going to look into this further.

Given what I saw, I suggest the following interfaces:

  • mw.wikibase.isInHierarchy( fromId, toId, propertyId ) -> bool or nil (returns nil in case the maximum recursion depth or some other limit was exhausted)

    Say we want to find out whether "Kreuzberg" (Q308928) is a "administrative territorial entity" (Q56061). For this, we fetch the "instance of" value from "Kreuzberg" (Q308928) which is " locality of Berlin" (Q35034452). With that we can mw.wikibase.isInHierarchy( 'Q35034452', 'Q56061', 'P279' ) -> true. This function could optionally allow restricting/ changing the access/ recursion limits further via additional (optional) parameters.
  • mw.wikibase.getClosestInHierarchy( fromId, toIds, propertyId ) -> string or false or nil (returns false if no entity from toIds could be found; nil in case the maximum recursion depth or some other limit was exhausted)

    Finds the closest entity from toIds that is linked to fromId via a propertyId chain. With the example above: mw.wikibase.getClosestInHierarchy( 'Q35034452', {'Q56061', 'Q12345'}, 'P279' ) -> 'Q56061'.

Both of these would need to be at least marked as expensive, if not also specifically restricted further (like the number of entities a single user can access). Also we need to find sane default limits… for that we probably better start low so that we can increase the numbers later on.
Furthermore we need to decide what usage we want to track here (the full path, like it is with the current functionality?).

Given we don't have anything better at hands, I think for the initial version we should just get full entities (in PHP) and manually traverse the "tree" (much like the modules do it, but with less serialization overhead). For this we might also want to look into T191341: Introduce a service for looking up specific statements from entities.

Moved this to review to gather comments about my proposal (T179155#4102708).

What happens when Kreuzberg has more than one “instance of” statements? Should the surrounding Lua code then call mw.wikibase.isInHierarchy / mw.wikibase.getClosestInHierarchy in a loop? I wonder if it’s necessary to support a list of fromIds instead of a single fromId for both functions…

(Side note, the example for getClosestInHierarchy should probably use getClosestInhierarchy and not isInHierarchy.)

What happens when Kreuzberg has more than one “instance of” statements? Should the surrounding Lua code then call mw.wikibase.isInHierarchy / mw.wikibase.getClosestInHierarchy in a loop? I wonder if it’s necessary to support a list of fromIds instead of a single fromId for both functions…

Hm, we could do this… if we search with both fromId values at once (compared to exhausting one first and then look for the other), this might also gain us some performance?!

(Side note, the example for getClosestInHierarchy should probably use getClosestInhierarchy and not isInHierarchy.)

Doh, fixed.

My personal remarks:

  • Please find variable names that are more specific than fromId, toId(s), and propertyId.
    • fromId could be named childId.
    • toIds could be named parentIds.
    • propertyId could be named hasParentPropertyId or relationshipPropertyId.
    • The method itself could be named getClosestParentId then.
  • I'm not a fan of making to many parameters tables. Sure, this feels kind of natural. My investigation in T182147#3815415 also hinted at this possibility. However, the function will become complex and hard to use.
    • Multiple fromIds can be emulated by calling the function in a loop. Allowing fromIds to be a table does not save us much, I believe. We would need to traverse multiple trees upwards instead of one. Well, these trees might overlap, which allows us to cache and reuse intermediate results. But the same kind of caching can be done when the Lua function is called multiple times in a loop.
    • toIds should be a table, I believe. This allows to specify more than one possible end-condition. For example, I want to check if a biography is properly categorized as either "fictional" or "living person". "Fictional" and "living person" are exclusive and not subclasses of each other.
    • propertyIds can be a table. The algorithm that traverses the tree upwards would not become more complex because of this. Even with one propertyId given, each entity can have multiple. Allowing more than one propertyId would not change the fact that entities in this tree can have any number of parents. However: Because of the complexity that is often not needed I would not make this parameter a table.

My conclusion: I like the concept that returns an entity ID more.

My personal remarks:

  • Please find variable names that are more specific than fromId, toId(s), and propertyId.
    • fromId could be named childId.
    • toIds could be named parentIds.
    • propertyId could be named hasParentPropertyId or relationshipPropertyId.
    • The method itself could be named getClosestParentId then.

So this would look like mw.wikibase.getClosestParentId( childId, parentIds, relationshipPropertyId ) -> string (entity id) or false (nothing found) or nil (limit exhausted).

  • I'm not a fan of making to many parameters tables. Sure, this feels kind of natural. My investigation in T182147#3815415 also hinted at this possibility. However, the function will become complex and hard to use.
    • Multiple fromIds can be emulated by calling the function in a loop. Allowing fromIds to be a table does not save us much, I believe. We would need to traverse multiple trees upwards instead of one. Well, these trees might overlap, which allows us to cache and reuse intermediate results. But the same kind of caching can be done when the Lua function is called multiple times in a loop.

It still makes a difference to search one exhaustive first and then another… but probably not that much. We can go with a single from/ child id, and expand later on, if actually needed.

  • toIds should be a table, I believe. This allows to specify more than one possible end-condition. For example, I want to check if a biography is properly categorized as either "fictional" or "living person". "Fictional" and "living person" are exclusive and not subclasses of each other.

We can just make it a table in all cases, we need to support this anyway.

  • propertyIds can be a table. The algorithm that traverses the tree upwards would not become more complex because of this. Even with one propertyId given, each entity can have multiple. Allowing more than one propertyId would not change the fact that entities in this tree can have any number of parents. However: Because of the complexity that is often not needed I would not make this parameter a table.

Yeah, we can expand this later on, if there really is a need.

My conclusion: I like the concept that returns an entity ID more.

I didn't think of them as alternatives, but rather thought about having both. Do you think we should have the simple boolean version as well? Would be a convenience shortcut mostly.

I just spend ages trying to find an accurate name and class+function documentation, but at https://github.com/wmde/WikibaseDataModelServices/pull/193 I have a proposal for an interface for a lookup service. I tried hard to catch the use case accurately here, so that it covers all kinds of relationships (not only parents, via "subclass of", but also things like "has part" and probably many more).

Next step from here would be to implement that service based on an EntityLookup (like TypeCheckerHelper::isSubclassOf (in WikibaseQualityConstraints)).

I suspect that, when implementing this, you will end up supporting a search from multiple source IDs to the target IDs anyways, because you’ll get into that situation as soon as there’s more than one link to follow anywhere in the chain. And in that case, I feel like it might make sense to also support starting with a set of source IDs in the first place? But I’m not sure.

Do you think we should have the simple boolean version as well? Would be a convenience shortcut mostly.

Yes, I also think an additional boolean version is worth it.

[…] you will end up supporting a search from multiple source IDs to the target IDs anyways, because you’ll get into that situation as soon as there’s more than one link to follow anywhere in the chain.

Yes, I think this will be the case. The moment the code starts traversing the tree upwards, the one original source node from level 0 will already be a list on level 1. However, I follow Marius' argument: let's avoid premature optimization and make this a list when the need arises.

[…] you will end up supporting a search from multiple source IDs to the target IDs anyways, because you’ll get into that situation as soon as there’s more than one link to follow anywhere in the chain.

Yes, I think this will be the case. The moment the code starts traversing the tree upwards, the one original source node from level 0 will already be a list on level 1. However, I follow Marius' argument: let's avoid premature optimization and make this a list when the need arises.

Should this be part of the services interface?

Should this be part of the services interface?

As written in https://github.com/wmde/WikibaseDataModelServices/pull/193#issuecomment-379189512:

In my opinion an array does not make much sense. The method would need to return two entity IDs then: the closest end point that was found, as well as the starting point that lead to this result.
Let's please stick to a single root node for this tree.

While looking into this, I found T195982.

This needs https://github.com/wmde/WikibaseDataModelServices/pull/196 (and then the actual release). Afterwards the dependency in Wikibase needs to be bumped to this and mediawiki/vendor needs updating.

Change 437629 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] WIP Introduce mw.wikibase.getReferencedEntityId

https://gerrit.wikimedia.org/r/437629

Change 438117 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/vendor@master] Update "wikibase/data-model-services" to 3.10.0

https://gerrit.wikimedia.org/r/438117

Change 438117 merged by jenkins-bot:
[mediawiki/vendor@master] Update "wikibase/data-model-services" to 3.10.0

https://gerrit.wikimedia.org/r/438117

Change 437629 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Introduce mw.wikibase.getReferencedEntityId

https://gerrit.wikimedia.org/r/437629

Once this is deployed (Thursday this week), we should gradually ask people to use this.

Has it been added to the Lua documentation page?

Has it been added to the Lua documentation page?

I updated it now: [[https://www.mediawiki.org/wiki/Extension:Wikibase_Client/Lua#mw.wikibase.getReferencedEntityId|mw.wikibase.getReferencedEntityId]].