Page MenuHomePhabricator

Allow entering Wikidata sitelinks to wikis that have non-typical wiki ID (not matching the database name)
Open, HighPublic

Description

Currently, Wikibase accepts only one way for sitelinks. their canonical database name (their id in sites table). This is causing numerous problems:

  • They don't always match to language code of the wikis, e.g. zh-classical.wikipedia.org has its canonical id set as zh_classicalwiki (we expect people to know dash turns to underline)
    • It gets even worse, we have one renamed wiki that's be-tarask.wikipedia.org and its canonical site id is b_x_oldwiki and as result we don't recognize the wiki's actual language code and we force people to use the deadname of the wiki and we show the deadname as well.
      • As result, this is blocking further wiki renames that are in the queue for years
  • We only accept one way to enter sitelink which users could possibly be able to enter different ways (how that would work is not defined, and not in scope of this task)
  • We are exposing internals of our system (dbname of wikis), to users, it's not an issue for transparency or security but for user experience, this is not nice.

Changes intended:

  • Site IDs containing hyphens are accepted and mapped to Site IDs with underscores for storage (and mapped back to hyphens for display)
  • There will be a configurable site ID aliases to be used in favour of "canonical" site IDs

Examples:

  • zh-classicalwiki is used to refer to wiki, which canonical site ID is zh_classicalwiki (note - vs _)
  • be_taraskwiki and be-taraskwiki is used to refer to wiki, which canonical site ID be_x_oldwiki
  • be-tarask is the expected way of communicating the wiki ID of the "Belarussian Taraškievica" Wikipedia

Acceptance criteria

  • Configuration is open for more aliases to be added for other siteids
  • The canonical site ids are still used for storage in JSON and other indexes.
  • Configuration is documented in options.md
  • Default Wikibase configuration does not have any aliases
  • Any canonical site id that has _ in them (like zh_min_nanwiki) has an alias of the underline replaced with dash (-> zh-min-nan) as a generic rule.
  • zh-classical, zh-min-nan and be-tarask are the IDs presented to the user in the sitelink editing UI
  • zh-classical, zh-min-nan and be-tarask are the IDs presented in the JSON output provided by Wikibase APIs (including Special:EntityData)
  • WMF production config is adjusted so that
    • be_taraskwiki, and be-taraskwiki is accepted as an site identifier when adding a sitelink to "Belarussian Taraškievica" Wikipedia (canonical site ID be_x_oldwiki)
    • identifiers containing underscores, as well as be_x_oldwiki as still accepted as site identifiers when adding a sitelink to a respective Wiki
  • Wikibase (including but not limited to WMF production) should work like this without any config change:
    • identifiers containing hyphens, e,g, zh-classicalwiki are accepted as an site identifier, and adds a sitelink to a Wikipedia with a canonical site ID containing underscores instead of hyphens, e.g. "Chinese classical" Wikipedia (canonical site ID zh_classicalwiki) both in UI and API

Original bug: (includes the description of possible approach, not expected to be followed)

Wikibase (and MediaWiki) need a more flexible way to handle site ids. In particular:

For API input (for wbaddsitelink, etc) several aliases should be supported per wiki.

  • In addition to the global ID, at least the domain should be usable as a wiki id
  • it should be possible to define additional aliases for input, for use when wikis get renamed, as was recently the case for be-x-old -> be-tarask.

For manual input in the UI, at least the above aliases should be supported

  • in addition, per-group IDs/Aliases should be supported (e.g. "en" means "enwiki" in context of the "wikipedia" group)
  • these aliases should be provided to the UI by the SitesModule

For output, two "labels" should be available:

  • a long, globally unique label, which would also work as input to the UI widget and API module. The full domain name of the wiki should do.
  • a per-group shorthand, which would also work as input to the UI widget. This would usually be the language code, e.g. "en" for en.wikipedia.org

To achieve the above, we need a service (or several services) that provide the following functions:

getGlobalAliases( $globalSiteId ): string[] // all globally unique aliases for $globalId
getLocalAliases( $groupId, $globalSiteId ): string[] // all aliases unique within the given group (including the global ones)
getGlobalName( $globalSiteId ): string // the preferred name that is also a globally unique alias 
getLocalName( $groupId, $globalSiteId ): string // the preferred name that is also an alias unique in the given group

getAllGlobalAliases(): string[][] // map siteId -> list of globally unique aliases
getAllLocalAliases( $group ): string[][] // map siteId -> list of all locally unique aliases for members of the given group

resolveAlias( $alias, $group = null ): string // return the global site ID for the given alias. Local aliases are supported if $group is given.

These functions would probably be implemented on top of a SiteList. SiteList and Site may have to be extended to provide access to additional information. The schema of the sites table should be flexible enough to accommodate all we need. The information in the SiteList can be mapped as follows:

  • the global ID is used as the primary identifier, as well as a global alias (and thus also a local alias).
  • all "local ids" (navigation ids, interwiki ids) would be also count as global ids. Note the different meaning of "local" in this context
  • a site's domain name would act as a global id, as well as the "global label"
  • a site's subdomain would act as a local id, as well as the "local label" (alternatively, we could use the language code)
  • additional aliases can be stored as "extra data"
  • the site's global and local label can be overwritten by "extra data"

Notes
This would be created in Wikibase where it could be proved and later pushed to MediaWiki core.
As noted above regardless of which approach is taken some core objects such as Site and SiteList might need modifications.
Data about the mappings needs to come from somewhere. Core currently stores most of this in a DB table, but that is painful. Perhaps this approach should just take this from config or a file.

This task is only about having a tested merged implementation of the service.
Usage of the service would be specified in separate tasks.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
daniel triaged this task as High priority.Feb 14 2016, 4:52 PM

This blocks wiki renaming, see T21986

@Ladsgroup I am looking. Not immediately for certain.

Addshore renamed this task from [Task] Implement SiteIdMapper service to [Task] Implement SiteIdMapper PHP service.Aug 3 2020, 8:38 AM

@Addshore @Ladsgroup:
Having look at this task with developers we were not clear about certain details which are probably obvious but we didn't find answers just looking at this task. Incomplete list below:

  • While the task provides quite detailed outline of the contract of the requested service, we didn't find any details on where/how the service should be used. I suspect it might be in one of 200+ linked tasks, but I didn't get to digest them.
  • It is unclear whether the said service should be implemented in Wikibase (with further possibility of upstreaming to MediaWiki later on), or should it be implemented straight away in MediaWiki?
  • I personally didn't catch how this task connects to Wikibase/Wikidata. Those "Sites" seem to me like a generic MediaWiki concept, and the service as a generic MediaWiki thing. Is this correct? A comment above (T114772#6214993) mentions this blocking wiki renames, which, in my ignorance, sounds to me like something WMF does, not Wikidata/WMDE.
  • Rephrasing the above: Why WMDE should implement this and not the relevant WMF team? What is the WMDE's use case for this service?

I am not sure if i didn't any other essential question. Possibly @Pablo-WMDE @Lucas_Werkmeister_WMDE @Michael who were active in the conversation on this remember something extra.

To clarify on the point

I personally didn't catch how this task connects to Wikibase/Wikidata.

If the reason why this service is considered need in Wikibase is the pieces of new functionality describe in the description as

For API input (for wbaddsitelink, etc) several aliases should be supported per wiki.
For manual input in the UI, at least the above aliases should be supported
For output, two "labels" should be available:

then those look like new features to be added. As such, this task is tackling the problem backwards, i.e. it prescribes the solution instead of describing the intended new/changed behaviour of the API/other product.

@Addshore @Ladsgroup:
Having look at this task with developers we were not clear about certain details which are probably obvious but we didn't find answers just looking at this task. Incomplete list below:

  • While the task provides quite detailed outline of the contract of the requested service, we didn't find any details on where/how the service should be used. I suspect it might be in one of 200+ linked tasks, but I didn't get to digest them.

More than everything, it's about wikis that have a different dbname and dns record. For example, be_x_oldwiki is actually be-tarask.wikipedia.org and editors of this wiki have to enter be_x_oldwiki in wikidata to be able to enter their articles (while in paper, they might not even know what be_x_old is). This is also blocks renaming further wikis (e.g. zh-classical.wikipedia.org to lzh.wikipedia.org). It gets even worse as the language code for zh-classical is lzh (in termbox for example) but the sitelinks has to stay zh_classical if we rename the wiki and don't fix this issue in wikidata.

Here's an example of a bug this task would tackle T112426: [Bug] Querying Wikipedia for langlinks doesn't work for be-tarask, but works for be-x-old

  • It is unclear whether the said service should be implemented in Wikibase (with further possibility of upstreaming to MediaWiki later on), or should it be implemented straight away in MediaWiki?

I personally think it should be in Wikibase, our narrow usecase for this is that we would have a php service in data access that takes a language code and a sitelink group and returns a dbname for it e.g. for ("fa", "wikipedia") returning "fawiki", "be-tarask", "wikipedia" returning "be_x_oldwiki" and "de", "wikiquote" returning "dewikiquote" (I assume exceptions like be-x-old to be-tarask would be configurable hard coded values).

This would simplify and centralize the logic that have spread out too.

  • I personally didn't catch how this task connects to Wikibase/Wikidata. Those "Sites" seem to me like a generic MediaWiki concept, and the service as a generic MediaWiki thing. Is this correct? A comment above (T114772#6214993) mentions this blocking wiki renames, which, in my ignorance, sounds to me like something WMF does, not Wikidata/WMDE.

It's blocking renaming wikis due to renamed wikis not being accessible under new name in Wikidata. Fixing that I assume is our job.

  • Rephrasing the above: Why WMDE should implement this and not the relevant WMF team? What is the WMDE's use case for this service?

As I explained above, this is an issue with wikibase codebase that assumes there is strict mapping between language code and db name (e.g. fa,wikipedia -> fawiki) while such thing doesn't exist in reality 100% of the time (be-tarask, wikipedia -> be_x_oldwiki)

I am not sure if i didn't any other essential question. Possibly @Pablo-WMDE @Lucas_Werkmeister_WMDE @Michael who were active in the conversation on this remember something extra.

HTH

Ladsgroup updated the task description. (Show Details)
WMDE-leszek renamed this task from [Task] Implement SiteIdMapper PHP service to Allow storing Wikidata sitelinks to wikis that have non-typical wiki ID (not matching the database name).Oct 14 2020, 10:08 AM

So we always want to store a canonical site id for the JSON and various other indexes (rather than changing it as renames happen).
Do we always want to display the canonical one in the UI? or would we prefer to show something else?
This is probably something that needs a little product consideration cc @Lydia_Pintscher
Once we have that answer I think we want to write some ACs ready for pickup and perhaps strip down some of the old description leaving a potential example of service implementation.

WMDE-leszek updated the task description. (Show Details)
WMDE-leszek updated the task description. (Show Details)
WMDE-leszek updated the task description. (Show Details)

Okay, it looks like the one reamaining open question is around this example:

be_taraskwiki and be-taraskwiki is used to refer to wiki, which canonical site ID be_x_oldwiki
be-tarask is the expected way of communicating the wiki ID of the "Belarussian Taraškievica ortography" Wikipedia

This needs more than just an alias for input to be converted to a canonical.
This also need a prefered representation / alias for display for some site ids.

I'd raise the question to @Lydia_Pintscher about this then. We can technical do this in the UI, but would keep the JSON with the storade canonical ID.
Doing anything beyond that would start to expand the scope of this task a bit too much, and might make sense to be done seperatly.
@Ladsgroup do me this issue of display seems like a smaller part of the whole rename problem anyway?

@Ladsgroup I've adjusted the task description to focus on the issue more than on the possible solution. As an expert here, do you know that zh_classicalwiki and be_x_oldwiki are the only problematic cases in the current WMF wikis infrastructure, or were those just examples, and there is more to consider/include?

@WMDE-leszek Looks like this was not picked up in story time again, were there more open questions? (Sorry I could not attend)

It was not discussed given I assumed there are open questions in your comment T114772#6541988

For a given site we have at least the following info we want to have in the UI:

  • which group does it belong to (Wikipedia, Wikivoyage, misc, etc.)
  • which display title does it have (en, meta, wikidata, etc) <- this conversion from enwiki to en is currently done by the Site ID to interwiki gadget on Wikidata and therefor not available on other WB installations
  • which names it can be found under for the selector when adding a new sitelink (en, English, specieswiki, etc)

Hope that helps.

Okay, it looks like the one reamaining open question is around this example:

be_taraskwiki and be-taraskwiki is used to refer to wiki, which canonical site ID be_x_oldwiki
be-tarask is the expected way of communicating the wiki ID of the "Belarussian Taraškievica ortography" Wikipedia

This needs more than just an alias for input to be converted to a canonical.
This also need a prefered representation / alias for display for some site ids.

I'd raise the question to @Lydia_Pintscher about this then. We can technical do this in the UI, but would keep the JSON with the storade canonical ID.
Doing anything beyond that would start to expand the scope of this task a bit too much, and might make sense to be done seperatly.

internally it should be only stored with their canonical value. The site global id (the db name) never changes for a wiki, even if it gets renamed.

@Ladsgroup do me this issue of display seems like a smaller part of the whole rename problem anyway?

Specially given that most of it is in the gadget, I assume it's not our job but we need to make sure the validation of frontend code at least accepts aliases.

@Ladsgroup I've adjusted the task description to focus on the issue more than on the possible solution. As an expert here, do you know that zh_classicalwiki and be_x_oldwiki are the only problematic cases in the current WMF wikis infrastructure, or were those just examples, and there is more to consider/include?

  • zh-classical is one example of many many wikis that have dash in their language code, according to the SiteMatrix, the rest are cbk-zam (-> cbk_zamwiki), bat-smg, fiu-vro, map-bms, nds-nl, roa-rup, roa-tara, zh-min-nan, zh-yue but we don't need to keep a hard-coded list, just the rule of - -> _
  • The only one that's completely different is be-tarask ( -> be_x_oldwiki) but if this is resolved, I assume more will follow. IIRC, zh-classical (to lzh), no to nb and another of Chinese wikis will be renamed (but that's for the future)

@Ladsgroup do me this issue of display seems like a smaller part of the whole rename problem anyway?

Specially given that most of it is in the gadget, I assume it's not our job but we need to make sure the validation of frontend code at least accepts aliases.

Great we don't need to be concerned with that then. (for now / for this ticket)

@Ladsgroup I've adjusted the task description to focus on the issue more than on the possible solution. As an expert here, do you know that zh_classicalwiki and be_x_oldwiki are the only problematic cases in the current WMF wikis infrastructure, or were those just examples, and there is more to consider/include?

  • zh-classical is one example of many many wikis that have dash in their language code, according to the SiteMatrix, the rest are cbk-zam (-> cbk_zamwiki), bat-smg, fiu-vro, map-bms, nds-nl, roa-rup, roa-tara, zh-min-nan, zh-yue but we don't need to keep a hard-coded list, just the rule of - -> _
  • The only one that's completely different is be-tarask ( -> be_x_oldwiki) but if this is resolved, I assume more will follow. IIRC, zh-classical (to lzh), no to nb and another of Chinese wikis will be renamed (but that's for the future)

If - -> _ is to be a general path for aliases then I think the description needs to be updated to reflect that.
Currently reading the task description I would assume this this is just a list of ID 1 -> ID2 for example.
Having something generic like - -> _ also open up the question of should this happen on other wikibases (probably not by default), this this type of replacement would also need to be done in a generic way.
@Ladsgroup mind having a go at updating the ACs etc to reflect that?

I believe that would be the only remaining open point.

Addshore renamed this task from Allow storing Wikidata sitelinks to wikis that have non-typical wiki ID (not matching the database name) to Allow entering Wikidata sitelinks to wikis that have non-typical wiki ID (not matching the database name).Oct 14 2020, 2:08 PM

If - -> _ is to be a general path for aliases then I think the description needs to be updated to reflect that.

Yeah it's a general thing since name of database can't contain - in MySQL.

Currently reading the task description I would assume this this is just a list of ID 1 -> ID2 for example.
Having something generic like - -> _ also open up the question of should this happen on other wikibases (probably not by default), this this type of replacement would also need to be done in a generic way.

I assume third parties have to follow the same pattern (maybe replace it with another allowed character? that seems very unlikely) given that MySQL wouldn't let you have the dash inside. I think this is even assumed in our codebase already (can't find it though)

@Ladsgroup mind having a go at updating the ACs etc to reflect that?

Sure.

I believe that would be the only remaining open point.

Changed the AC, feel free to edit mercilessly.

For a given site we have at least the following info we want to have in the UI:

  • which group does it belong to (Wikipedia, Wikivoyage, misc, etc.)
  • which display title does it have (en, meta, wikidata, etc) <- this conversion from enwiki to en is currently done by the Site ID to interwiki gadget on Wikidata and therefor not available on other WB installations
  • which names it can be found under for the selector when adding a new sitelink (en, English, specieswiki, etc)

Hope that helps.

Thanks @Lydia_Pintscher . I believe for the purpose of this task/feature request could you please define what should be the expected behaviour of showing the "wiki ID"/"site ID" e.g. when adding a sitelink (in site id selector), for the following wikis:

  • Chinese classical Wikipedia: currently zh_classical. Should it be zh-classicalwiki? Or some other value?
  • Chinese Southern Min Wikipedia: currently zh_min_nan. Should it be zh-min-nan? Or some other value?
  • "Belarussian Taraškievica ortography" Wikipedia: : currently be_x_oldwiki. Should it be be-tarask? Or be_tarask? Or some other value?

thanks

Thanks @Lydia_Pintscher . I believe for the purpose of this task/feature request could you please define what should be the expected behaviour of showing the "wiki ID"/"site ID" e.g. when adding a sitelink (in site id selector), for the following wikis:

  • Chinese classical Wikipedia: currently zh_classical. Should it be zh-classicalwiki? Or some other value?
  • Chinese Southern Min Wikipedia: currently zh_min_nan. Should it be zh-min-nan? Or some other value?
  • "Belarussian Taraškievica ortography" Wikipedia: : currently be_x_oldwiki. Should it be be-tarask? Or be_tarask? Or some other value?

thanks

I believe we should follow this behaviour:

image.png (54×183 px, 2 KB)

And I'm think - is better than _ as a separator we show to humans.

So then we'd end up with zh-classical, zh-min-nan and be-tarask unless @Ladsgroup has objections.

The user input part of this seems to be working already – I can type be-tarask and get a suggestion for be_x_oldwiki:

be-tarask.png (107×372 px, 22 KB)

zh-classical and zh-min-nan also work.

The user input part of this seems to be working already – I can type be-tarask and get a suggestion for be_x_oldwiki:

be-tarask.png (107×372 px, 22 KB)

zh-classical and zh-min-nan also work.

Two notes:

  • That part is handled by UI, I assume we need something that also handles API requests as well (specially reducing the drift between UI and API behavior).
  • Also I think it should be the other way around, the UI should not suggest it with the old (canonical db) name, for sake of better user experience, it should suggest the alias instead. Does it make sense?
  • Also I think it should be the other way around, the UI should not suggest it with the old (canonical db) name, for sake of better user experience, it should suggest the alias instead. Does it make sense?

this is what is described in the task currently. good it has been confirmed

I assume we need something that also handles API requests as well (specially reducing the drift between UI and API behavior).

Noted for the task - it is unclear what the current API behaviour is

WMDE-leszek updated the task description. (Show Details)
WMDE-leszek updated the task description. (Show Details)

Superseded by more "atomic" T267791 and T267793

re-opening per the claim this task is being observed by the larger audience

What is the question @Ladsgroup?
The topic has been investigated by the WMDE team: T269139 and next steps are to be outlined afterwards.

I was asking for an update. That aforementioned ticket has not been connected to this one and thus I haven't seen any update on this for two months.

zh-classical, zh-min-nan and be-tarask are the IDs presented in the JSON output provided by Wikibase APIs (including Special:EntityData)

This is listed as one of the ACs, but can't be done, the full ID must be used in external output where grouping of sitelinks is not done.
Grouping is only done in the UI, so we can only strip things like "wiki" from there.