Page MenuHomePhabricator

Allow entering Wikidata sitelinks to sites that have an ID containing hyphens (not matching the database name)
Closed, DeclinedPublic5 Estimated Story Points

Description

Currently, Wikibase accepts only one way for sitelinks. their canonical database name (their id in MW sites table).
It is a convention on Wikimedia wikis that site ID use hyphens as a delimiter, whereas canonical database names use underscores.
Wikibase currently does take this convention into account.

Example: zh-classicalwiki ("Chinese Classical Wikipedia") is used to refer to wiki, which canonical site ID is zh_classicalwiki (note - vs _)

Change requested: Site IDs containing hyphens are accepted and mapped to Site IDs with underscores for storage (and mapped back to hyphens for display)

Acceptance criteria

  • The canonical site ids are still used for storage in JSON and other indexes.
  • zh-classical, zh-min-nan are allowed to be used in the input in the sitelink editing UI
  • zh-classical, zh-min-nan are the IDs presented to the user in the sitelink editing UI (not the case currently)
  • zh-classical, zh-min-nan are the IDs presented in the JSON output provided by Wikibase APIs (including Special:EntityData) (not the case currently)
    • The change in the the JSON output should be configurable, so it could be enabled on demand, to allow WMDE to follow the Stable Interface possible
  • zh-classical, zh-min-nan are allowed to be used in the input to Wikibase APIs (eg. wbeditentity)
  • Special:GoToLinkedPage, Special:ItemByTitle accept all alias Site IDs, as well as the canonical site ID
  • Wikibase (including but not limited to WMF production) should work as follows: identifiers containing hyphens, e,g, zh-classicalwiki are accepted as an site identifier, and add a sitelinks to a Wikipedia(s) with a canonical site ID containing underscores instead of hyphens, e.g. "Chinese classical" Wikipedia (canonical site ID zh_classicalwiki) both in UI and API

Originally part of T114772

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Other questions that I didn’t have time to ask in the story time (the first isn’t really a question but I’d like to have it confirmed anyways):

  • It’s understood that the changes made to the JSON output are breaking changes and will be announced according to the Stable Interface Policy, yes?
  • Will APIs like wbeditentity and wbsetsitelink still accept the underscore version as inputs? (Otherwise that’s another breaking change, though we could combine the announcement with the last one.)

(Edit: both of these apply to T267793: Allow entering Wikidata sitelinks to sites using an alternative (alias) site ID (not matching the database name) as well, I suppose.)

Other questions that I didn’t have time to ask in the story time (the first isn’t really a question but I’d like to have it confirmed anyways):

  • It’s understood that the changes made to the JSON output are breaking changes and will be announced according to the Stable Interface Policy, yes?
  • Will APIs like wbeditentity and wbsetsitelink still accept the underscore version as inputs? (Otherwise that’s another breaking change, though we could combine the announcement with the last one.)

(Edit: both of these apply to T267793: Allow entering Wikidata sitelinks to sites using an alternative (alias) site ID (not matching the database name) as well, I suppose.)

AFAIK, the json storage and json representation should not change, mostly because history will be completely messed up. The json should store the dbname, nothing more and nothing less.

@Ladsgroup note that the question was not about the JSON blog that is stored. It is defined it should not changed.
Question is about the JSON representation in the APIs ("action" API, special entity page, etc). My interpretation of requirements has been that this JSON representation should contain the favored form (i.e. hyphenated one). It would be great to have this interpretation confirmed by someone who knows what is the right thing. thanks

Estimated based on T267793 being completed first.

WMDE-leszek set the point value for this task to 5.Nov 17 2020, 2:45 PM

@Ladsgroup note that the question was not about the JSON blog that is stored. It is defined it should not changed.
Question is about the JSON representation in the APIs ("action" API, special entity page, etc). My interpretation of requirements has been that this JSON representation should contain the favored form (i.e. hyphenated one). It would be great to have this interpretation confirmed by someone who knows what is the right thing. thanks

bump @Ladsgroup @Addshore

@Ladsgroup note that the question was not about the JSON blog that is stored. It is defined it should not changed.
Question is about the JSON representation in the APIs ("action" API, special entity page, etc). My interpretation of requirements has been that this JSON representation should contain the favored form (i.e. hyphenated one). It would be great to have this interpretation confirmed by someone who knows what is the right thing. thanks

In my honest opinion, we don't need to change that json either. Given that the more drifts from the represented json and stored json causes more confusion and issues (like ores not being able to handle data types) because they are very similar but yet different in unexpected ways specially given that the stored json is publicly accessible through API (query revisions).

I don't know the context where those drifts would happen and be bad, but in principle having different presentation and persistence models is fine, I'd think?

I have to admit I am being slightly confused of the purpose of the change then?
We allow, and encourage, hyphenated site IDs for user input (via UI, and APIs) but we present underscored ones? And you say it would not lead to confusion if done this way?

My, apparently wrong, understanding of reasoning behind this change was using similar "style" for site IDs (hyphenated) as is used (for input and output) for language code e.g. in item labels.
Was it not? What is the problem this change intends to solve then?

I don't know the context where those drifts would happen and be bad, but in principle having different presentation and persistence models is fine, I'd think?

That would be fine if we didn't expose the persistence, but we do expose it. And not just exposing it, tools and services use that as well.

I have to admit I am being slightly confused of the purpose of the change then?
We allow, and encourage, hyphenated site IDs for user input (via UI, and APIs) but we present underscored ones? And you say it would not lead to confusion if done this way?

Yes but there is a big difference between UI where non-tech savvy people use and API where people who already know the difference use.

My, apparently wrong, understanding of reasoning behind this change was using similar "style" for site IDs (hyphenated) as is used (for input and output) for language code e.g. in item labels.
Was it not? What is the problem this change intends to solve then?

The difference is the input and output. Currently for articles of https://be-tarask.wikipedia.org the user can't input "be-tarask" (or "be_tarask") and it gets even worse, in UI, "be-x-old" works but not in API which is pretty bad and we should bring some clarity in it but there will be some level of inconsistency somewhere anyway and we shouldn't make AC to fix every inconsistency in the system. Because clients of wikidata can handle representation inconsistencies in API. For example, pywikibot itself fixes the dash to hyphen when building the Site() object from the sitelinks of wikidata.

Per discussion with @Addshore and @Ladsgroup this part of T114772 would have only been a convenience in the UI, therefore is not critical and will not be worked on.