Page MenuHomePhabricator

Wikidata allows invalid URIs to be entered as units
Open, HighPublic

Description

Wikidata allows URI which are not valid entities to be entered as units. E.g., in this change:
https://www.wikidata.org/w/index.php?title=Q420481&diff=prev&oldid=494773425

http://www.wikidata.org/entity/1 is entered as unit.

Event Timeline

Got another series of the same, e.g. https://www.wikidata.org/w/index.php?title=Q961&diff=501476468&oldid=492032927 and many other edits by ShinePhantom. Looks like some broken script may be out there. @Ladsgroup, does pywikibot have checks for this? If not, maybe we need to add some.

thiemowmde added a subscriber: Lydia_Pintscher.

The relevant validation currently done in ValidatorBuilders.php is a substring match for http://www.wikidata.org/entity/. This already disallows all …/wiki/ URLs. Namespace, entity type, and entity ID are currently not validated.

It should not be that hard to create a validator that only accepts a single entity type (or a set of entity types), checks the namespace (note that items can be in the main namespace, or in an "Item:" namespace), parses the entity ID, and makes sure it matches the entity type. Service classes for all these individual checks should already exist (probably EntityNamespaceLookup and an EntityIdParser).

Note that calendar model and globe URIs are missing the exact same validation. It's probably a good idea to have a single ticket for all three.

(I missed this ping for some reason)

@Ladsgroup, does pywikibot have checks for this? If not, maybe we need to add some.

I don't think so, we might be able to do so but IMO, there should be other validators inside Wikibase too.

Another example here:

https://www.wikidata.org/w/index.php?title=Q4679732&diff=552439126&oldid=552439107

http://www.wikidata.org/entity/undefined is entered as unit.

Looks like this one was imported with HarvestTemplates (run by @Pasleim), maybe he or @matej_suchanek can see how to prevent this kind of invalid imports in the future.

Mentioned again in this discussion, it appears to be the same issue.

Note that per the data model specification, the unit can be any URI (or rather, any IRI):

The unit specifies a physical quantity that the number refers to. It is represented as a IRI rather than as a String, since a string like "m" might represent different units in different contexts. The value should be meaningful independently of the declaration information for its Property (from which more details about units could possibly be obtained), hence the unit is a full IRI. In practice, this IRI might be the IRI refering to an ItemDescription representing the desired unit, or be taken from a standard vocabulary for units, like QUDT.

So any validation that restricts this to Wikidata entities should not be hard coded, since other Wikibase installations may choose to rely on a different vocabulary for units.

@Lydia_Pintscher this has been marked as high for 2 years.
Is that case still so?
Also its on the campsite so should I move it towards pickup columns?

Yes let's. And as Thiemo suggested let's do it for calendar model and globe as well.

Yes let's. And as Thiemo suggested let's do it for calendar model and globe as well.

Technically speaking I recommend making it a constraint and part of WBQC, otherwise we have to maintain list of units, globes, etc inside the code, something I would love to avoid.

Making sure pywikibot and other tools also use the correct value also sounds like a good investment for us.

[…] we have to maintain list of units, globes, etc inside the code

As suggested a while ago, I don't think it is necessary to make the code know what valid units, globes, etc. are. All I'm asking for is a basic check for http://www.wikidata.org/entity/Q\d+. In other words: If the URL points to the wikidata.org domain, make sure it not only uses the proper concept base URI (e.g. http://www.wikidata.org/entity/ or whatever it is), but also make sure it addresses a valid Q-ID.

The later is currently not done. All it currently does is a prefix check. See ValidatorBuilders.php.

I'm not sure if any of the use cases (units, calendar models, …) currently allows to use other domains that are not wikidata.org. If this is the case, the validator should first check if the domain is wikidata.org, and only then apply the other checks explained above.