Page MenuHomePhabricator

Customize Item Identifier prefix (currently: Q)
Closed, DeclinedPublic

Description

For non-Wikidata installations, it would be beneficial to use letters other than Q for the Wikibase items. Custom installations still tend to rely on Wikidata as an external reference source, so having two types of Q numbers - those in Wikidata, and those in the custom Wikibase - tend to be highly confusing, especially for a large community-driven project like OpenStreetMap.

For example, if we have OSM-specific items, they should be easily distinguishable from Wikidata items - e.g. Y123 vs Q123.

I have looked in depth at the code, and it seems Q has been hardcoded in a large number of places. Some of these places, like WikibaseDataModel do not appear to have a "settings" context, yet they have ItemId::newFromNumber() methods that use hardcoded Q.

I will try to work on introducing this, but I may need some guidance on how best to do it. Thanks!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 455073 had a related patch set uploaded (by Yurik; owner: Yuri Astrakhan):
[mediawiki/extensions/Wikibase@master] [WIP] Allow custom Item & Prop prefixes

https://gerrit.wikimedia.org/r/455073

This is much more an issue of WikibaseDataModel: see https://github.com/wmde/WikibaseDataModel/commit/1e3f1730abb03cdaf4a1ae4644672b025608f743

In my opinion what should be done instead is to phase out unprefixed entity ID except in places that really need to specify the prefix (e.g. making new IDs or for backward compatibility dump format).

To phase out unprefixed entity ID:

  • no breaking change of RDF mapping (and thus query service) needed
  • breaking change of JSON mapping needed; we probably need first exporting JSON with both numeric-id and id, then deprecate and remove numeric-id. Inputing will support support numeric-id
  • for API, no other change than JSON mapping
  • Lua: currently return both id and numeric-id, should numeric-id be eventually removed?

On the other hand, keeping numeric-id may have some benefits (e.g. sorting), and we can choose not to remove numeric-id but generate it from id (this is independent of what the prefix is as long as prefix is one letter), which will make no breaking change (in input/output interface).

I believe this is going the wrong direction, because numeric entity IDs are almost completely phased out already. A big chunk of that was done in 2016 via T56085, and a whole lot of other tickets, most notably the work the team put into migrating the wb_terms table. The newFromNumber methods that are still in the code are barely used any more.

What this ticket asks for should be resolved via a feature we call "federation": The possibility to reference entities that are stored on other Wikibase instances. This is solved via prefixes. See T76007 and T149580 and the linked tickets.

Fiddling with somehow unique letters would not work in the long run, as we would need a global registry for these Q, P, Y, and so on to guarantee they can never clash. But there are only 26 of them. See T73996 for an older discussion about exactly this idea. Prefixes as defined for the "federation" feature don't have this problem. These prefixes are not globally unique, but locally defined, and map to URIs, which are guaranteed to be unique.

Yes unique entity IDs should be prefixes or full URLs.

However (e.g.) two wikis both using Qids, one as https://www.wikidata.org/wiki/Q123, the other as https://www.example.com/wiki/Q123

You can differentiate them via (e.g.) wikidata:Q123 and (local) Q123, or full URL, but end users (who does not necessary know how federation work) may be confused.

Therefore make prefix customizable is more user-friendly, even if the prefix is not unique. i.e. you can not prevent other foreign repositories use different prefixes, but you may (though not must of course) make your local prefix different from all foreign ones, probably via prefixes with more than one letters (e.g. OSM1234 or ON123 (node)/OW456 (way)/OR789 (relation)).

@Bugreporter and @thiemowmde , thank your for the thorough responses!

Thiemo, just as Bugreporter mentioned, changing Q to [anything] is a big usability request. Here is an OpenStreetMap entry for Salzburg part of UI:

image.png (178×311 px, 9 KB)

A link to Wikidata is shown as a Q number, and the tag name ("wikidata") indicates the context. There are many other Wikidata tags like brand:wikidata and subject:wikidata. By now, most OSM users are very familiar with the Q numbers, so whenever someone is looking at the data, or talks on the forums, Q123 instantly gives everyone the context of what is being discussed without any additional information. Note that this is not a programming concept as one would do in the WDQS federated query, or a prefixed wd:Q123 value. Q123 is enough as it is.

The goal of our new project is to add a structured Wikibase store to the OSM wiki itself. This will allow OSM community to organize the keys (left side of the above table). For example, take a look at Key:place -- it's a big wiki page describing the place key, and a summary "info card" on the right side -- very similar to how Wikipedia organizes a lot of data. That info card should be stored in a Wikibase-backed store, e.g. OSM wiki's Item:Q123 to allow external tools usage. But introducing another set of Q-numbers will create confusion in discussions especially for the novice users. Seeing Q123 in a forum thread would no longer be unambiguous.

If, instead, we can set up OSM wiki to use another prefix, e.g. Y123, there would not be any ambiguity. It is OK for other sites to also use the same prefix -- it is highly unlikely that OSM will decide to use another site as a massive external reference site to cause significant confusion. Wikidata is uniquely positioned to be the central hub of the reference data, so we really only have two types -- the "local" and the "wikidata". Of course for the WDQS services, the OSM items will clearly have to be uniquely-prefixed.

I agree that the Q prefix should only be appended during the item creation, and only use an opaque string identifier everywhere else. What steps can we do to achieve that?

Update: I have implemented a custom extension that overrides default item https://github.com/nyurik/OsmWikibase using a hook :

public static function onWikibaseRepoEntityTypes( &$entityTypeDefinitionsArray ) {

  $entityTypeDefinitionsArray['item']['entity-id-pattern'] = OsmItemId::PATTERN;

  $entityTypeDefinitionsArray['item']['entity-id-builder'] = function ( $serialization ) {
    return new OsmItemId( $serialization );
  };

  $entityTypeDefinitionsArray['item']['entity-id-composer-callback'] = function ( $repositoryName, $uniquePart ) {
    return OsmItemId::newFromRepositoryAndNumber( $repositoryName, $uniquePart );
  };

  return true;
}

The OsmItemId extends ItemId class, overriding the 'Q' to 'Y'.

There are a few other places with 'Q' as part of the Wikibase extension, not in the model. I'm still trying to determine what they affect.

The database was updated using these SQL statements (there was no talk pages)

UPDATE abuse_filter_log SET afl_title=CONCAT('Y',substring(afl_title, 2)) where afl_title like 'Q%' and afl_namespace = 120;
UPDATE archive SET ar_title=CONCAT('Y',substring(ar_title, 2)) where ar_title like 'Q%' and ar_namespace = 120;
UPDATE cu_changes SET cuc_title=CONCAT('Y',substring(cuc_title, 2)) where cuc_title like 'Q%' and cuc_namespace = 120;
UPDATE cur SET cur_title=CONCAT('Y',substring(cur_title, 2)) where cur_title like 'Q%' and cur_namespace = 120;
UPDATE job SET job_title=CONCAT('Y',substring(job_title, 2)) where job_title like 'Q%' and job_namespace = 120;
UPDATE logging SET log_title=CONCAT('Y',substring(log_title, 2)) where log_title like 'Q%' and log_namespace = 120;
UPDATE page SET page_title=CONCAT('Y',substring(page_title, 2)) where page_title like 'Q%' and page_namespace = 120;
UPDATE pagelinks SET pl_title=CONCAT('Y',substring(pl_title, 2)) where pl_title like 'Q%' and pl_namespace = 120;
UPDATE protected_titles SET pt_title=CONCAT('Y',substring(pt_title, 2)) where pt_title like 'Q%' and pt_namespace = 120;
UPDATE querycache SET qc_title=CONCAT('Y',substring(qc_title, 2)) where qc_title like 'Q%' and qc_namespace = 120;
UPDATE querycachetwo SET qcc_title=CONCAT('Y',substring(qcc_title, 2)) where qcc_title like 'Q%' and qcc_namespace = 120;
UPDATE querycachetwo SET qcc_title=CONCAT('Y',substring(qcc_title, 2)) where qcc_title like 'Q%' and qcc_namespace = 120;
UPDATE recentchanges SET rc_title=CONCAT('Y',substring(rc_title, 2)) where rc_title like 'Q%' and rc_namespace = 120;
UPDATE redirect SET rd_title=CONCAT('Y',substring(rd_title, 2)) where rd_title like 'Q%' and rd_namespace = 120;
UPDATE templatelinks SET tl_title=CONCAT('Y',substring(tl_title, 2)) where tl_title like 'Q%' and tl_namespace = 120;
UPDATE text SET old_title=CONCAT('Y',substring(old_title, 2)) where old_title like 'Q%' and old_namespace = 120;
UPDATE watchlist SET wl_title=CONCAT('Y',substring(wl_title, 2)) where wl_title like 'Q%' and wl_namespace = 120;

UPDATE wb_terms SET term_full_entity_id=CONCAT('Y',substring(term_full_entity_id, 2)) where term_full_entity_id like 'Q%';
UPDATE wb_changes SET change_object_id=CONCAT('Y',substring(change_object_id, 2)) where change_object_id like 'Q%';
UPDATE revision SET rev_comment=replace(rev_comment, "Item:Q", "Item:Y") where rev_comment like '%Item:Q%';
UPDATE text SET old_text=replace(old_text, '"id":"Q', '"id":"Y') where old_text like '%"id":"Q%';
UPDATE page_props SET pp_value=CONCAT('Y',substring(pp_value, 2)) where pp_propname='wikibase_item' AND pp_value like 'Q%';

Spoke too soon: overriding entityTypeDefinitionsArray in a hook works well for creating/editing Wikibase items, but it does not work in some cases like sitelink lookups (used by wiki pages to find corresponding wikibase items).

I think to make custom prefixes possible, all direct usages of the ItemId::* methods should only be made via the entity types table. For example, ItemId::newFromNumber() in SiteLinkUsageLookup.php usage.

How can entityTypeDefinitionsArray be made available to that code?

Change 455480 had a related patch set uploaded (by Yurik; owner: Yuri Astrakhan):
[mediawiki/extensions/Wikibase@master] Delegate ItemID creation/parsing to an interface

https://gerrit.wikimedia.org/r/455480

I think I figured out most of the code needed for item ID parsing/creation without calling new ItemID() directly. Submitted as a new patch ^, let me know if this approach

TL;DR: I, personally, believe what appears to be a cosmetic issue in OSM UI should not be solved by changing the Wikibase data model.

  • "Q" does not mean "wikidata.org". It means "item" and is used by all Wikibase installations so far.
  • Retroactively "reserving" the letter "Q" to be exclusively used by wikidata.org can't work. It was never meant to be like this, and there is no mechanism for this.
  • "Q" only means "wikidata.org" to users who know about wikidata.org. These users should not have a problem understanding that the moment an OSM Wikibase installation exists, "osm:Q1" refers to this installation.
  • Most end-users don't care much about the letter "Q". They click a link, and if that link points to an OSM subdomain, they will understand they are not looking at wikidata.org.
  • If the problem is in the OSM UI, it should be solved in the OSM UI. One can add icons in front of the ambiguous "Q1" links, or clearly label them as "Wikidata:Q1" and "OSM:Q1".
  • Having other entity types with other letters is already possible. However, this is significantly more expensive than just reusing items. The question one should ask is the question of cost-benefit, as well as compare other possible ways of solving the ambiguity issue.
  • All that said, in an ideal world it should indeed be possible to change the letter "Q" via configuration. Technically, it should only be used in two places: When a new item ID is generated, and when a parser needs to detect the entity type from a full ID. However, this "ideal world" does not exist (yet). Especially a swarm of 3rd-party tools, maintenance scripts, gadgets and such exist that hard-code the letter "Q". The moment a projects decides to ditch the letter "Q", the project will not be able to reuse most of this software any more. I, personally, could not estimate the costs this comes with. The need to validate every bit of software I might want to use in the future, and possibly change it, sounds very expensive to me.

The patch https://gerrit.wikimedia.org/r/455480 goes a bit off-track, as it touches mostly the wb_items_per_site table, which still needs migration away from numeric entity IDs. This is a known issue and tracked in T114904: Migrate wb_items_per_site to using prefixed entity IDs instead of numeric IDs.

Note from the past: we initially intended the prefixes to be configurable, but this turned out to be rather painful to implement consistently. So we dropped this possibility in favor of a guarantee that the prefix identifies the entity type. Breaking that guarantee is likely to cause issues in obscure places. I agree with Thiemo that an UI issue should not be fixed with a datamodel change.

@thiemowmde @daniel my last patch only refactors Wikibase extension a bit, without touching the data model. It fixes the last few places I found that used direct Item ID parsing/composition, instead using the same common interface used by other code. Are there any concerns with that? For example, Lua function getSiteLink() used TermIdconstructor, whereas all other code in that same file used EntityParser instance.

Merging this code should be a noop for Wikidata, and it may even simplify T114904 -- all related code may be in one place. Or it certainly shouldn't get in the way of the db migration.

So if this patch has no impact on Wikidata, but helps us, could you merge it to help us?

I do understand your point about the UI, but sadly that's the problem -- it is not just UI issue in OSM. The OSM ecosystem is much less centralized than Wikidata. There is no single editor to make it consistent. There is no common validation system. There is really only one "convention" that all tools understand -- that OSM data is stored as a simple key-value string table. Tools do not assume anything else about the data, e.g. semantic meaning of the keys, etc. So it is almost always up to the users to type in or copy/paste all values.

One of the conventions is how Wikidata Q items are entered - e.g. wikidata = Q42. Any user in any tool can instantly tell that Q42 is a Wikidata item, and some tools even make it into a link using heuristics, e.g. if the key contains the word wikidata, and the value matches Qnnn regex, it shows as a link, but other tools simply show it as text. Most OSM users by now are fairly well versed in this.

If we introduce OSM Wikibase, it will be stored in the same way - a text string, with some key. For example, if community decides to introduce a few items in OSM Wikibase, e.g. "type of OSM feature", and possibly one more "something else", the feature key-value table would look like this:

keyvalue
feature_typeQ123
wikidataQ456
something_elseQ789

As you can see, this is very confusing. We cannot suddenly change how we store Wikidata items - that will break all of the existing tools and data consumers. Requiring an`osm:Q123` style for osm-stored items would also lead to problems - many users will simply forget to type it in, and multiple OSM editing tools make it impossible to effectively enforce it. Having no way to visually distinguish the source of item will potentially create a chaos.

I have already extensively tested my patch, and it seems that it solves what we need - effective way to store "osm items" as having different prefix. I'm sure I can work with other tools to adapt them to this need as well.

Change 455073 abandoned by Yurik:
[WIP] Allow custom Item & Prop prefixes

Reason:
See https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/ /455480/ instead

https://gerrit.wikimedia.org/r/455073

I strongly suggest to go with one of these solutions:

keyvalue
feature_typeosm:Q123
wikidataQ456
something_elseosm:Q789
keyvalue
osm_feature_typeQ123
wikidataQ456
osm_something_elseQ789
keyvalue
osmbase_feature_typeQ123
wikidataQ456
osmbase_something_elseQ789
Yurik renamed this task from Support non-Q and non-P items in custom Wikibase installations to Support non-Q items in custom Wikibase installations.Sep 20 2018, 7:51 PM
Jan_Dittrich renamed this task from Support non-Q items in custom Wikibase installations to Customize Item Identifier prefix (currently: Q).Oct 1 2018, 10:48 AM

I've been thinking a lot about this. The prefixes Q, P, L, F, S, E and M are there to represent the concepts of Items, Properties, Lexemes, Forms, Senses, Entity Schemas and MediaInfo respectively. They are not intended as an indication of a specific Wikibase instance. Each entity type should be addressed with the same letter on every Wikibase instance. Distinction between individual Wikibase instances needs to happen with prefixes for example.

Hello, we have installed a wikibase instance and we would like to know how to differentiate
from the wikidata items and our newly created items. Thanks

@Topway.it: Hi, Phabricator is not a support forum. Please see "Contact" on https://www.wikidata.org/ - thanks.

Change 455480 abandoned by Addshore:

[mediawiki/extensions/Wikibase@master] Consolidate ItemID creation and parsing

Reason:

Abandoning as lots of this has changed recently

https://gerrit.wikimedia.org/r/455480

Hello! Do you know if there have been any updates on this matter?

I saw https://github.com/nyurik/OsmWikibase was slightly modified by @Yurik this year, so maybe this is still at least desired (even with all caveats)?

I am studying applications that rely on Wikibase IDs to render pages (e.g. "https://scholia.toolforge.org/organization/Q23048689") that could access 2 or more Wikibases simultaneously.

Of course, one can always use a prefix in the URL (/localwb:Q42).

There is no change in how I am thinking about this, no. The Q is not there to identify Wikidata but to identify the entity type (Item). It should not be different on different instances.
Applications need to specify where they get their data from with a prefix.