Page MenuHomePhabricator

Add lexeme language codes ja-hira, ja-kana, ja-hrkt
Closed, ResolvedPublic


These are all codes for the representations of Japanese lexemes and forms with ISO 15924 codes (albeit lowercased) for Hiragana, Katakana, and a combination of Hiragana and Katakana.

At the moment these (save for the third) are each represented with Wikidata Qids attached to "ja-x-" (so that "Japanese written in hiragana" became "ja-x-q53979341" or "ja-x-q48332", and "Japanese written in katakana" became "ja-x-q53979342" or "ja-x-q82946").

(This request does not include ja-hani as the distinction between kyujitai and shinjitai is not reflected in ISO 15924 and for this the Qids for those character sets should remain in use; note that "Hans" and "Hant" do not map cleanly to those two sets of characters.)

(EDIT: As it turns out that using ja-jpan is discouraged (search for "Suppress-Script: Jpan" on that page), I have removed it from this request and will instead adjust uses of "ja-x-Q10997505" to use "ja" instead.)

Event Timeline

Shouldn't it be like in Hebrew, where the same lexeme has a default spelling (probably the same as ja-jpan) and variants identified by Q numbers?

I'm not really opposed to it, but what's the advantage of ja-hira, etc.?

@Amire80 To my knowledge it is not mandated anywhere that all variants of the representation of a lexeme lemma/form must use Q number private use subtags, but rather such uses are possible if other existing subtags within BCP47 cannot adequately indicate the necessary differences. The indication of Japanese written in different scripts can already be done with the BCP47 script subtag, so (¡¡¡)within the scope of language codes(!!!) the items I mentioned which are currently being used for those indications are redundant. Also, as I noted above, the distinction between kyujitai and shinjitai does not lend itself to a non-private-use indicator within the set of possible "ja" language tags, so this task is not meant to discourage the use of those private use subtags in that case.

And besides, with respect to items, we already have language tags with different ISO 15924 codes for Tunisian Arabic, Crimean Tatar, Goan Konkani, Eastern Canadian Inuktitut, Kazakh, Kashmiri, Kurmanji, Megleno-Romanian, Tachelhit, Serbian, Tajik, Tatar, Uyghur, and Uzbek, so there is clearly a precedent for the inclusion of codes of the sort which are the subject of this task.

I was going to say the same as @Mahir256: We use xx-x-Q### tags only when there is no valid BCP47 code available for use (in MediaWiki), but using the proper codes is definitely preferable to these private use codes. So these codes seem like perfectly good additions to me.

Mahir256 renamed this task from Add lexeme language codes ja-hira, ja-kana, ja-hrkt, ja-jpan to Add lexeme language codes ja-hira, ja-kana, ja-hrkt.Sep 10 2020, 12:05 AM
Mahir256 updated the task description. (Show Details)

Since @Amire80 didn't object to a patch, I will make one, once T254968 gets merged. I refactored the code for extra languages a bit and it's easier to make a patch when I can base on that code.

Change 627568 had a related patch set uploaded (by Mbch331; owner: Mbch331):
[mediawiki/extensions/WikibaseLexeme@master] Add several lexeme language codes

Change 627568 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add several lexeme language codes