Support for non-integer Wikidata IDs (or alternative)
Closed, DeclinedPublic

Description

Wikibase could be highly useful for other projects, such as organizing OpenStreetMap tags (see discussion) The main difference with Wikidata usecase is that OSM has community-set string IDs, not integers. Storing string IDs as regular properties would cause a number of problems: they are not unique, they are mutable, they can be easily deleted, and there could be more than one defined per entry.

Would it be possible to either have a user-provided entry key, or a special immutable single string value property? It might be even OK to allow such entry creation via API only, and not to have a UI.

Yurik created this task.Apr 16 2018, 12:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 16 2018, 12:49 AM
Yurik updated the task description. (Show Details)Apr 16 2018, 12:58 AM
Micru added a subscriber: Micru.Apr 16 2018, 7:09 AM

So if I understand correctly, you would like to have custom Entity IDs instead of the usual Q followed by an integer (Qxxxx). On Wikidata normally the Entity IDs are not visible, what the user sees is the label of the entity. So I assume that in the use case of OSM you could have an entity with all the labels with the same string value for all languages, which could be enforced with a bot. That way you can maintain the value of the tag or change it when necessary.

Lydia_Pintscher closed this task as Declined.Apr 16 2018, 7:31 AM
Lydia_Pintscher added a subscriber: Lydia_Pintscher.

Sorry but the fact that we assign the IDs the way we do is baked deeply into the whole system and I don't think it is worth it to change it.

For a 3rd party site, it would not be terribly hard to implement a new entity type that extends the Item type to have an additional immutable string ID. This would be done by a custom extension on top of Wikibase. I don't think we'd deploy such a thin on Wikidata, though - we'd end up with hundreds of such custom types that only differ by a handful of fields, which with it's own slightly different logic.

Yurik added a comment.Apr 16 2018, 8:17 AM

thx, clearly this wouldn't be used for wikidata. @daniel could you point me to the relevant code plz, or perhaps something similar, or maybe sketch the implementation approach? I was thinking that this would be the only real path to solve this, possibly with a custom database table to prevent accidental duplicates.

A good example of how to add a custom entity type is https://www.mediawiki.org/wiki/Extension:WikibaseMediaInfo. The entry point for defining an entity type is the wiring file at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikibaseMediaInfo/+/master/WikibaseMediaInfo.entitytypes.php. For creating entities of a new type, especially with extra requirements, a new API module should be implemented, similar to the one we introduced for Lexeme Forms: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikibaseLexeme/+/master/src/Api/AddForm.php.

A generic mechanism for preventing duplicates is in planning, see T74430: Re-implement uniqueness constraint in a consistent and efficient way.

daniel added a comment.EditedApr 16 2018, 10:26 AM

Oh, I just realized. You could fake this using a fake sitelink. On wikidata, an item's sitelinks points to articles in sister projects like wikipedia. It should not be hard to allow pages on non-mediawiki sites to be references in the same way - or even just pretend to reference a page. Sitelinks are unique, and can even be used to address items in the API.

Not going to happen on Wikidata, but seems ok for a 3rd party site.

Yurik added a comment.Apr 16 2018, 3:29 PM

@daniel this is awesome, thanks! The sitelink hack would probably be the best approach for the tag keys (e.g. name, address etc.) I am not yet sure of the best approach for the "enum-like" values -- some tags, e.g. religion tend to have a well-known list of values. It would be good to have a separate namespace for all the possible values for them. Enforcing uniqueness for them does not make sense -- same value could be used for more than one tag, and could have very different meaning. One option to store tag value would be to allow duplicates (use a regular string property to store the actual value), to reference all tag entities, and to use a bot to ensure that tags are referenced only ones per each unique value. Or it could be some on-db-update check...

@Yurik now you lost me. "relgion" isn't a unique ID. And it should be editable. Why not use a regular statement? That's what we do on wikidata. But this seems to be entirely unrelated to this ticket.

Also, address may be unique (if you include enough detail), "name" certainly isn't. I'm nut sure I understand what problem you are trying to solve.

Yurik added a comment.Apr 17 2018, 8:30 PM

@daniel sorry, let me clarify.

The goal is to use Wikbase as a secondary documentation (metadata) for the tags, not as a tag storage replacement. We need to describe each tag, describe how they relate to each other, provide localized documentation, etc.

OSM has multiple key-value tags (strings) for each object. The string keys could be represented by Wikibase entities (with some hacking to use string instead of integers to ID, per above). Some of these tags use "free form" values - like address - there is no need to store/document these in Wikibase, but we may store some validation information in the tags itself (e.g. store some regex in the "zip code" tag) Some other tags like religion imply a predefined list of values (enum). As you can see, there is currently a big list on that page, and that list could be stored in Wikibase.

So the question is - should that list of values be stored as statements on the "religion" tag, or use a more flexible approach to store each as a separate entity, possibly in a separate namespace. Storing them as statements may make that entry too big (hundreds), and not offer enough flexibility. For example, for religion=pagan -- pagan should link to Wikidata's entry, and also may link to some other key to specify that it should/shouldn't be used with that, or it may point to another value as a redirect.

So if values are stored in a separate namespace, what would be the good way to identify them? One way could be "religion=pagan" form -- include both the tag key and value. But this is bad because renaming is hard... so unsure of the best approach...

Sounds like you are re-inventing what wikidata calls "constraints" - have a look at https://www.wikidata.org/wiki/Help:Property_constraints_portal