Page MenuHomePhabricator

[Bug] Querying Wikipedia for langlinks doesn't work for be-tarask, but works for be-x-old
Open, HighPublic

Description

be-x-old.wikipedia.org was renamed to be-tarask.wikipedia.org (T11823).

Now, querying the API with langlinks for be-tarask doesn't work, but querying for be-x-old does work.

See:

I'd expect both to work: be-tarask as the current and be-x-old for backwards compatibility.

I'm not sure whether it's a bug in core, in Wikimedia configuration, in Wikidata, or elsewhere. This might be related to T111822, but I'm creating this bug in case it isn't.


Specification:

  • One-ILL-per-language
  • Do not replace ILL prefixes to ensure correct URL based on interwiki / interlanguage table ( interwiki iw_prefix field )
  • Allow multiple ILL prefixes for same ILL lang code
    • InterlanguageLinkCodeMap
  • Allow treat deprecated language code as different ILL language, especially for BCP-47-overlapped ( ExtraInterlanguageLinkPrefixes )
  • Make API lang parameters case-insensitive
    • Decide whether to support BCP 47 language code for API lang parameters
  • Split lang parameters into lang and prefix
    • lang (prefix) => prefix ( only either llprefix=be-x-old or llprefix=be-tarask work )
    • lang (lang) => lang ( both lllang=be-tarask and lllang=be-x-old should work )
  • Output both prefix and BCP 47 language code for API queries
    • Decide whether to output MediaWiki internal language code for API queries

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is because be-x-old is still in the wb_items_per_site table and other places in the database (e.g. the entity json blobs).

We need to figure out a migration strategy for this and such renames in general.

This might include adding a new entry in the sites table with be_tarask as the site id and keeping the old id, but marking it somehow as inactive.

When new site links are added, don't allow new site links to be_x_old(?) and don't allow there to be site links to both be_x_old and be_tarask?

Have a maintenance script update existing site links to the new site id.

In old revisions, we still have the old site id and everything (e.g. viewing diffs, etc.) still needs to work with it.

possibly the rename of existing site links could be a bot task

probably not something we can fix tomorrow, but have suggested this as something high priority for our next sprint (starting tuesday)

It seems it's not possible to add sitelink to be_tarask yet, If it's possible I will do the bot job :)

Best

As far as the API goes, it just uses whatever is in the langlinks table (which, as seems to be implicitly noted already, comes from wikidata.org now). I doubt we'll end up having both working as queries, as that would require duplicate entries in said table.

For reference: when generating langlinks based on information from Wikidata, Wikibase relies on information from the sites_identifiers table. The respective code could be added or replaced there. Not sure what would happen if multiple interlanguage prefixes would be defined there for the same wiki, though.

JanZerebecki renamed this task from Querying English Wikipedia for langlinks doesn't work for be-tarask, but works for be-x-old to [Bug] Querying English Wikipedia for langlinks doesn't work for be-tarask, but works for be-x-old.Sep 18 2015, 1:37 PM
JanZerebecki lowered the priority of this task from Unbreak Now! to High.
JanZerebecki moved this task from incoming to consider for next sprint on the Wikidata board.
Ricordisamoa renamed this task from [Bug] Querying English Wikipedia for langlinks doesn't work for be-tarask, but works for be-x-old to [Bug] Querying Wikipedia for langlinks doesn't work for be-tarask, but works for be-x-old.Sep 18 2015, 2:02 PM

Also entries like https://www.wikidata.org/wiki/Q8937989 probably need to be updated if the language codes changed.

Mmmm... any update about this? This breaks some ContentTranslation features (such as T112285), and delays the renaming of more Wikimedia domains to standard language codes (T21986).

Urbanecm added a subscriber: Urbanecm.

Bugs aren't for Wikimedia-Site-requests I think.

I just run into this issue while debugging why c:Template:Label is not working correctly for users coming from "be-tarask" wiki. I finally tracked it down to the fact that mw.wikibase.getEntity('Q1'):getSitelink( 'be-taraskwiki' ) does not return anything while mw.wikibase.getEntity('Q1'):getSitelink( 'be_x_oldwiki' ) does. What is even more confusing is that mw.wikibase.getEntity('Q1'):getLabel('be-tarask') returns a label but mw.wikibase.getEntity('Q1'):getLabel('be_x_old') does not. So for the same language "be-tarask" aka. 'be_x_old' different functions require different language codes to work.

What is the current progress? From what I see mentioned here, a simple database migration has to be written to change be-x-old to be-tarask in several tables as well as in already generated json-blobs. Is it that difficult a task? I realise that a CLI script doing something like this might end up running for a while but I do not see evidence that it is really a blocker here. What can be done to facilitate this moving forward?

Maybe a good news for us, or maybe not here: T209089, which although looks literary like fixing sidebar, the principles of that do also apply here.

https://github.com/wikimedia/Wikibase/blob/8dbd84e/client/includes/Hooks/LangLinkHandler.php#L322-L344

T137537: Ensure correct information about Wikimedia sites in the Sites facility on the Wikimedia cluster.

client/includes/Hooks/LangLinkHandler.php#L322-L344
	/**
	 * Extracts the local interwiki code, which in case of the
	 * wikimedia site groups, is always the global id's prefix.
	 *
	 * @fixme put somewhere more sane and use site identifiers data,
	 * so that this works in non-wikimedia cases where the assumption
	 * is not true.
	 *
	 * @param Site $site
	 *
	 * @return string
	 */
	public function getInterwikiCodeFromSite( Site $site ) {
		// FIXME: We should use $site->getInterwikiIds, but the interwiki ids in
		// the sites table are wrong currently, see T137537.
		$id = $site->getGlobalId();
		$id = preg_replace( '/(wiki\w*|wiktionary)$/', '', $id );
		$id = strtr( $id, [ '_' => '-' ] );
		if ( !$id ) {
			$id = $site->getLanguageCode();
		}
		return $id;
	}

Change 876295 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/extensions/Wikibase@master] Temporary fix for Wikibase Client getInterwikiCodeFromSite

https://gerrit.wikimedia.org/r/876295

Change 879580 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] Aliasing deprecated language codes for QueryLangLinks API

https://gerrit.wikimedia.org/r/879580

Change 876295 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/extensions/Wikibase@master] Temporary fix for Wikibase Client getInterwikiCodeFromSite

https://gerrit.wikimedia.org/r/876295

Together with the MediaWiki core change mentioned above, I think this change is viable and (once the core change is reviewed) ready to merge, but it will change what’s recorded in the langlinks database table, which might confuse existing Quarry queries, tools etc. – I think this needs approval from a Wikidata PM (@Lydia_Pintscher, @Manuel, @Arian_Bozorg?), and we should consider announcing it somewhere (Tech News might work).

Change 887854 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] Support gradual migration of language links from deprecated codes - tracking categories

https://gerrit.wikimedia.org/r/887854

Mmmm... any update about this? This breaks some ContentTranslation features (such as T112285), and delays the renaming of more Wikimedia domains to standard language codes (T21986).

@Amire80 could you test to see if https://gerrit.wikimedia.org/r/c/mediawiki/core/+/879580 solves your problem? I'd like to give that patch a bit more of a test before merging it and crossing my fingers it helps.

In the task description of T112426, @Winston Sung wrote:

Specification:

  • One-ILL-per-language
  • Do not replace ILL prefixes to ensure correct URL based on interwiki / interlanguage table ( interwiki iw_prefix field )
  • Allow multiple ILL prefixes for same ILL lang code
    • InterlanguageLinkCodeMap
  • Allow treat deprecated language code as different ILL language, especially for BCP-47-overlapped ( ExtraInterlanguageLinkPrefixes )
  • Make API lang parameters case-insensitive
    • Decide whether to support BCP 47 language code for API lang parameters
  • Split lang parameters into lang and prefix
    • lang (prefix) => prefix ( only either llprefix=be-x-old or llprefix=be-tarask work )
    • lang (lang) => lang ( both lllang=be-tarask and lllang=be-x-old should work )
  • Output both prefix and BCP 47 language code for API queries
    • Decide whether to output MediaWiki internal language code for API queries

Do this sounds good for you?

CC:
@Amire80
@cscott
@Lucas_Werkmeister_WMDE