Page MenuHomePhabricator

SDoC {{#statements}} parser function gives bad data in some situations
Open, Needs TriagePublic

Description

On Commons, this works as expected: {{#statements:P195|from=M96461067}}. It produces the value of M96461067's P195 statement (National Archives at College Park - Still Pictures).

This one does not work as expected: {{#statements:P195|from=M89709639}}. Instead, it seems to produce [[Category:Media contributed by Toledo-Lucas County Public Library]], which is not the label of M89709639's P195 value, but actually the value of that item's P373 statement, and not even part of M89709639's structured data at all.

This can be reproduced with other statements, but I'm not yet sure what is the cause (so please edit the title of this task as necessary!). For example, {{#statements:P9126|from=M96461067}} is a property with three values. Two of these work as expected, so it shows , National Archives and Records Administration, National Archives at College Park - Still Pictures, but the first value is again giving a category found in P373 of the property value's Wikidata item.

I have put various examples here of unexpected results: https://commons.wikimedia.org/wiki/User:Dominic/tests

Event Timeline

I am not sure if this is actually unexpected. {{#statements:P195|from=M89709639}} yields <span><span>[[Category:Media contributed by Toledo-Lucas County Public Library|Toledo-Lucas County Public Library]]</span></span> because of the P195 claim on M89709639 that points to Q7814140 which in turn has the commons sitelink that points to Category:Media contributed by Toledo-Lucas County Public Library (I doubt that sitelink is really correct and could use to be fixed, e.g., {{#property:P373|from=Q7814140}} also yields Media contributed by Toledo-Lucas County Public Library).

Q59661040 has no such commons sitelink so {{#statements:P195|from=M96461067}} just yields its label.

It seems the {{#statements:}} parser function is trying to return a local link when available and a label when not available (and likely the raw entity id when the label is not available). You will notice {{#property:}} does not attempt to do this as {{#property:P195|from=M89709639}} yields Toledo-Lucas County Public Library (also notice {{#statements:}} adds the extra span elements that {{#property:}} does not).

Look at this example more closely: {{#statements:P9126|from=M96461067}} yields <span><span>[[Category:Digital Public Library of America|Digital Public Library of America]]</span>, <span>[[National Archives and Records Administration|National Archives and Records Administration]]</span>, <span>National Archives at College Park - Still Pictures</span></span>. There are three such claims this time and the first two items have commons sitelinks and the last does not. The first also appears to erroneously link to the category yielding markup that puts the page into the category rather than actually linking to it (in a fashion similar to what you discovered).

Perhaps it is needless to reiterate the above only works on Commons since that the only place MediaInfo entity records can be accessed at (and the only place {{#statements:}} will by default expand commons sitelinks stored in Wikidata).

I took the liberty of modifying the tests you linked to, in order to better show the output.

@Uzume Thank you for all of this investigation! I really could not figure out what the logic was that was causing this, and makes more sense noticing the sitelink, and not that it was coming from a random other property (that just happens to be the same as the sitelink's value).

On the one hand, you are right that this is not really the parser function misbehaving after all. On the other hand, maybe this logic is still problematic? I think the issue we have discovered is it is common on Wikidata that folks are (erroneously?) linking concept entities to related categories on Wikimedia Commons using the sitelink. This makes sense for humans to do, in a way, because Commons does not have content pages, per se, and I guess no one really thought about this side effect.

I tried in WDQS to determine how widespread this practice really is. I wasn't able to get the counting query to complete without timing out (https://w.wiki/4SeW if you have a better approach), but there are clearly hundreds of thousands if not millions: https://w.wiki/4SeZ. Which leads me to wonder what is actually the solution. I'm not clear why this is happening; is this actually the normal practice on Wikidata? That's a lot of Wikidata sitelinks to fix if we don't want all those items to return categories instead of their intended labels, and also makes me wonder if this is just how Wikidata works.

It should probably be noted that there are Wikidata items that state they represent (P31) Wikimedia categories (Q4167836). Some of those have category sitelinks at Commons, i.e., Q9013822 sitelinks to Category:Text logos). These should probably not be considered in error despite also having P373 "Commons category" statements claiming the same value. Having a MediaInfo entity's statements linking to such Wikidata items might be considered erroneous (depending on the claims).

One should expect a Wikidata item that represents a "Wikimedia category" to have sitelinks that link to related categories amongst the sister sites, however, I believe there is a high likelihood that most sitelinks to categories are incorrect.

It should also be noted that there are linkly Wikidata items that are missing P31 claims or that they claim such to something that is a P279 "subclass of" Q4167836, e.g., Q15647814, Q59541917, Q56428020, etc. So though attempting to do some sort of Wikidata cleanup of sitelinks prefixed by "Category:" could be a good thing, it is also not a trivial thing to actually search for such issues.

Wikidata properties have many types of constraints (Q21502402 and claimed via P2302; see Help:Property constraints portal). We even have one for "Commons link constraint" Q21510852. As far as I know, there are none that could help in flagging issues of a claim's value matching a sitelink value but perhaps such a thing could be suggested, created and new constraints added to such properties (e.g., flagging "Commons category" P373 claims with errors suggesting the commons sitelink is wrong when those same claim values match their corresponding commons sitelink value, etc.) Then one of the two can be fixed (I believe having P373 claims on items representing Wikimedia categories is wrong and in those cases the link should be in the sitelink and not the claim).

We might also be able to look for all Wikidata items that have sitelinks prefixed by "Category:" and flag those if they do not also claim they are Wikimedia categories or a subclass of such. However, I do not think that could be accomplished by a property constraint.

One should expect a Wikidata item that represents a "Wikimedia category" to have sitelinks that link to related categories amongst the sister sites, however, I believe there is a high likelihood that most sitelinks to categories are incorrect.

I think this is the true issue. The "bad data" is not an issue of software returning incorrect values, but that there are so many values which have been input incorrectly. The solution is either to correct these or to change our expectations of the outputs, if that is truly what the community wants. It strikes me that correcting them is probably not all that difficult, and maybe could be automated. For example, in all cases where P31 is not Q4167836, if the sitelink is begins with Category:, then remove it and add it instead add the category back as a P910 statement.

Hm, this also makes a difference if an item is sitelinked to a file. With the current behavior, {{#statements:}} will generate [[File:…|label]] as wikitext, which will then display as an actual file.

Technically, it’s fairly simple to change this:

diff --git a/lib/includes/Formatters/EntityIdSiteLinkFormatter.php b/lib/includes/Formatters/EntityIdSiteLinkFormatter.php
index 434a5bdc08..a2e576393f 100644
--- a/lib/includes/Formatters/EntityIdSiteLinkFormatter.php
+++ b/lib/includes/Formatters/EntityIdSiteLinkFormatter.php
@@ -62,7 +62,7 @@ public function formatEntityId( EntityId $entityId ) {
 				$pageName = $title->getFullText();
 				$optionalLabel = $label === '' ? '' : '|' . $label;
 
-				return '[[' . $pageName . $optionalLabel . ']]';
+				return '[[:' . $pageName . $optionalLabel . ']]';
 			}
 		}

But I have no idea how many places are relying on the current behavior of formatting these links without the leading colon, to trigger whatever special behavior the namespace has.

I assumed the problem was with bad input, but it does appear from the way the Wikidata discussion is developing that this may be a case where the community's preferred way of doing things is not in alignment with the way this functionality was originally designed. So perhaps changing the behavior of the parser function's formatted output is the solution. But it's not a documented policy one way or the other on Wikidata, and it seems like the community really needs to codify it so developers and downstream users alike know what data is expected in these scenarios (and what needs to be fixed).