Page MenuHomePhabricator

Wikidata allows invalid URIs to be entered as units
Open, HighPublic


Wikidata allows URI which are not valid entities to be entered as units. E.g., in this change: is entered as unit.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 10 2017, 6:24 AM
Lydia_Pintscher triaged this task as High priority.Jun 11 2017, 4:28 PM
Lydia_Pintscher added subscribers: daniel, thiemowmde.
Smalyshev updated the task description. (Show Details)Jun 11 2017, 7:39 PM

Got another series of the same, e.g. and many other edits by ShinePhantom. Looks like some broken script may be out there. @Ladsgroup, does pywikibot have checks for this? If not, maybe we need to add some.

thiemowmde added a subscriber: Lydia_Pintscher.

The relevant validation currently done in ValidatorBuilders.php is a substring match for This already disallows all …/wiki/ URLs. Namespace, entity type, and entity ID are currently not validated.

It should not be that hard to create a validator that only accepts a single entity type (or a set of entity types), checks the namespace (note that items can be in the main namespace, or in an "Item:" namespace), parses the entity ID, and makes sure it matches the entity type. Service classes for all these individual checks should already exist (probably EntityNamespaceLookup and an EntityIdParser).

Note that calendar model and globe URIs are missing the exact same validation. It's probably a good idea to have a single ticket for all three.

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 24 2017, 2:21 PM

(I missed this ping for some reason)

@Ladsgroup, does pywikibot have checks for this? If not, maybe we need to add some.

I don't think so, we might be able to do so but IMO, there should be other validators inside Wikibase too.

Another example here: is entered as unit.

Another example here: is entered as unit.

Looks like this one was imported with HarvestTemplates (run by @Pasleim), maybe he or @matej_suchanek can see how to prevent this kind of invalid imports in the future.

abian added a subscriber: abian.Sep 23 2018, 11:30 AM

Mentioned again in this discussion, it appears to be the same issue.

Note that per the data model specification, the unit can be any URI (or rather, any IRI):

The unit specifies a physical quantity that the number refers to. It is represented as a IRI rather than as a String, since a string like "m" might represent different units in different contexts. The value should be meaningful independently of the declaration information for its Property (from which more details about units could possibly be obtained), hence the unit is a full IRI. In practice, this IRI might be the IRI refering to an ItemDescription representing the desired unit, or be taken from a standard vocabulary for units, like QUDT.

So any validation that restricts this to Wikidata entities should not be hard coded, since other Wikibase installations may choose to rely on a different vocabulary for units.

@Lydia_Pintscher this has been marked as high for 2 years.
Is that case still so?
Also its on the campsite so should I move it towards pickup columns?

Yes let's. And as Thiemo suggested let's do it for calendar model and globe as well.

Yes let's. And as Thiemo suggested let's do it for calendar model and globe as well.

Technically speaking I recommend making it a constraint and part of WBQC, otherwise we have to maintain list of units, globes, etc inside the code, something I would love to avoid.

Making sure pywikibot and other tools also use the correct value also sounds like a good investment for us.

[…] we have to maintain list of units, globes, etc inside the code

As suggested a while ago, I don't think it is necessary to make the code know what valid units, globes, etc. are. All I'm asking for is a basic check for\d+. In other words: If the URL points to the domain, make sure it not only uses the proper concept base URI (e.g. or whatever it is), but also make sure it addresses a valid Q-ID.

The later is currently not done. All it currently does is a prefix check. See ValidatorBuilders.php.

I'm not sure if any of the use cases (units, calendar models, …) currently allows to use other domains that are not If this is the case, the validator should first check if the domain is, and only then apply the other checks explained above.

Waiting for the sub task to be completed.