Page MenuHomePhabricator

Add monolingual language codes nan-hani, cdo-hani, hak-hans, hak-hant
Closed, ResolvedPublic

Description

Please add the following language codes to the list of language codes supported for monolingual text values.

The language code: nan-Hani
Language name in the language itself or English: Min Nan (Hanji)
The used script, if not obvious: Hani
Where and when the language was or is used: Minnan-speaking area, modern era
The Wikidata item id: Q15901848

The language code: cdo-Hani
Language name in the language itself or English: Min Dong (Chinese characters)
The used script, if not obvious: Hani
Where and when the language was or is used: Min-Dong-speaking people. Modern era.
The Wikidata item id: Q5365165

The language code: hak-Hans
Language name in the language itself or English: Hakka (Chinese character, Simplified)
The used script, if not obvious: Hans
Where and when the language was or is used: mainland China, modern era
The Wikidata item id: Q22827960

The language code: hak-Hant
Language name in the language itself or English: Hakka (Chinese character, Traditional)
The used script, if not obvious: Hant
Where and when the language was or is used: Taiwan, Hong Kong, etc., modern era
The Wikidata item id: Q18165189

Usage example: Use for wikidata items like Q865. Sample statements:

  • han-hani: missing
  • cdo-hani: missing
  • hak-hans: missing
  • hak-hant: missing

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ab6399 added a subscriber: Ab6399.

I will work on this issue

Change 555688 had a related patch set uploaded (by Ab6399; owner: Ab6399):
[mediawiki/extensions/Wikibase@master] Add several monolingual languages

https://gerrit.wikimedia.org/r/555688

Does langcom approve this? I couldn't find any clear approval so far.

I started a discussion in the Langcom about this.

在T180771#4544573中,@C933103写道:

And then for hak... Can someone verify that "Hakka (Traditional Han script)" and "Hakka (Simplified Han Script)" are proper way to describe how Hakka speakers would write their language in Han scripts?

Of course it is one of the correct way to write this language. Ministry of Education, ROC awards Literary Award of Taiwanese and Hakka (教育部閩客語文學獎, their website is https://www.edu.tw) every year. You can see hak-hant here.

Hello, my question was NOT about whether it can be written in Chinese script (which I know it can), instead my question was that, whether there are meaningful differences between "Hakka with Simplified characters" and "Hakka with Traditional characters", as some previously mentioned that in certain other Chinese languages, characters that are currently used by the Simplified script have other function in the written version of that language, making it almost impossible to write the language using Simplified script and thus there are no need to distinguish Simplified - Traditional Chinese for that language. What I would like to know is whether Hakka also fit this situation being described.

I'm not hearing any objections from the Language Committee, so I'm probably going to start adding these codes.

Let's start with nan-hani. What will be the autonym for it?

Lydia_Pintscher changed the task status from Open to Stalled.Sep 18 2020, 7:08 PM
Lydia_Pintscher added a subscriber: Lydia_Pintscher.

Let's start with nan-hani. What will be the autonym for it?

Can someone answer this?
Marking as stalled until we have an answer.

I'm not hearing any objections from the Language Committee, so I'm probably going to start adding these codes.

Let's start with nan-hani. What will be the autonym for it?

Based on the other Chinese autonyms in langdb.yaml, I would suggest "閩南語(漢字)". The first three characters are the language name in Chinese characters (as given on zh-min-nan.wikipedia.org) and the two characters inside the brackets are the word for Chinese characters (again, as used on zh-min-nan.wikipedia.org, see the last two characters of the autonym for cdo-hani too).

Mbch331 changed the task status from Stalled to Open.Dec 31 2020, 10:04 AM

Shouldn't they be lowercase for consistency?

Shouldn't they be lowercase for consistency?

Yes. If langcom agrees on all codes, I'll submit a patch will all lowercase, otherwise only with the approved languages.

Let's do nan-hani and see how it works.

Actually no, a moment.

If the autonym for cdo-hani doesn't have parentheses, should nan-hani have parentheses? I'd really love to hear from someone who knows Chinese well.

OK, I received several comments from speakers saying that parentheses are OK (example), so let's do nan-hani with 閩南語(漢字).

Change 669822 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/extensions/UniversalLanguageSelector@master] Update jquery.uls from upstream

https://gerrit.wikimedia.org/r/669822

Change 669822 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Update jquery.uls from upstream

https://gerrit.wikimedia.org/r/669822

Change 669925 had a related patch set uploaded (by Mbch331; owner: Mbch331):
[mediawiki/extensions/Wikibase@master] Add monolingual language code nan-hani

https://gerrit.wikimedia.org/r/669925

Change 669930 had a related patch set uploaded (by Mbch331; owner: Mbch331):
[mediawiki/extensions/cldr@master] Add monolingual language code nan-hani

https://gerrit.wikimedia.org/r/669930

Change 669925 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add monolingual language code nan-hani

https://gerrit.wikimedia.org/r/669925

Change 669930 merged by jenkins-bot:
[mediawiki/extensions/cldr@master] Add monolingual language code nan-hani

https://gerrit.wikimedia.org/r/669930

This comment was removed by Yejianfei.

Great job! We have added the language code nan-hani.

Now it is time to add the language code cdo-hani.

CodeEnglish nameAutonymAutonym (alternatives)
cdo-latnMin Dong Chinese (Foochow Romanized)Mìng-dĕ̤ng-ngṳ̄ (Bàng-uâ-cê)Mìng-dĕ̤ng-ngṳ̄ Bàng-uâ-cê
cdo-haniMin Dong Chinese (Chinese characters)閩東語(漢字)閩東語漢字
noarave added a subscriber: noarave.

nan-hani is merged, stalling this on the campsite board until the additional language codes are approved by LangCom.

@Yejianfei There is no Langcom approval yet to add those languages.

Change 672648 had a related patch set uploaded (by Aklapper; owner: Yejianfei):
[mediawiki/extensions/Wikibase@master] Add monolingual code cdo-hani and cdo-latn

https://gerrit.wikimedia.org/r/672648

@Yejianfei There is no Langcom approval yet to add those languages.

To clarify, the keyword here is "yet". I'm not against against cdo-hani in principle. I just wanted to make sure that when nan-hani is deployed, it works as expected. Is nan-hani now deployed? Does it work as expected? Can anyone give some examples?

@Amire80 Yes, nan-hani "Min Nan (Hanji)" was deployed and seems to be working as expected, thank you. I did not see any real live examples of usage yet.

@Manuel: Any examples yet? @Amire80: Is cdo-hani ok as well, or do you still want to wait?

Retracted in reaction to @Nikki's comment T180771#7158281

@Mbch331: nan-hani monolingual language code has 0 uses in Wikidata to date.

Retracted in reaction to @Nikki's comment T180771#7158281

Statistics about monolingual language code use in Wikidata (15 June 2021)

Our SPARQL queries timed out (e.g. https://w.wiki/3Tpz). So @Ladsgroup ended up running a dump-based query instead. We used the opportunity to get a broader look at monolingual use in general if you are interested:

https://gist.github.com/Ladsgroup/ccc7d885f8f57f32b52e969920b4a3a3

Autonym for nan-hani:

@Yejianfei There is no Langcom approval yet to add those languages.

To clarify, the keyword here is "yet". I'm not against against cdo-hani in principle. I just wanted to make sure that when nan-hani is deployed, it works as expected. Is nan-hani now deployed? Does it work as expected? Can anyone give some examples?

I have just added the nan-hani label to a few wikidata, according to either the hani version of article title on nan wikipedia, or hani lang template for title on latin character articles on the wikipedia. Examples include Q703914, Q127031, Q45190, Q660947, Q36778, Q2914034. I think it is working as expected.

p.s. It seems like Nan wikipedia is trying to use either namespace or category to categorize articles written in Hani but none appears to be comprehensive, and due to problem in wikidata those articles are also undiscoverable from wikidata, making it hard to find them ...

p.p.s. Should someone post about this on nan wikipedia Village pump?

Statistics about monolingual language code use in Wikidata (15 June 2021)

Our SPARQL queries timed out (e.g. https://w.wiki/3Tpz). So @Ladsgroup ended up running a dump-based query instead. We used the opportunity to get a broader look at monolingual use in general if you are interested:

https://gist.github.com/Ladsgroup/ccc7d885f8f57f32b52e969920b4a3a3

That list doesn't seem to be accurate, I can't find nod in the list, but it's used twice on https://www.wikidata.org/wiki/Q565110 (added in 2016 and 2019, so not new either).

Thank you @Nikki for making me aware of this! I have now retracted my original comments.

Thanks for flagging this. I ran the code on that particular item and it recorded nod usages. It seems the dump in stat machines are somewhat broken. I will look into it.

Okay, I have had enough of the json dumps moving around breaking the script. I ran this hadoop query (that's basically two weeks old)

SELECT regexp_extract(claim.mainsnak.datavalue.value,',\"language\"\\:"(.+?)"',1), count(*) as hitcount
FROM wmf.wikidata_entity
LATERAL VIEW explode(claims) t AS claim
WHERE snapshot='2021-06-07'
AND typ = 'item' and claim.mainsnak.datatype = 'monolingualtext'
group by regexp_extract(claim.mainsnak.datavalue.value,',\"language\"\\:"(.+?)"',1)
order by hitcount desc
LIMIT 1000;

The result is P16694

Statistics about monolingual language code use in Wikidata (7 June 2021)

Thanks to @Ladsgroup we now have basic data on monolingual code use in Wikidata:

Frequency of monolingual language code uses in Wkidata items
(N=46.334.060 claims with monolingual language code)

See the last comment for the Hadoop hive query that was used.

As the current Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) unstaller, what is this task stalled on?
How can we unblock it?
Should this remain as part of the iteration? Can this be reviewed now? Or should the remaining parts be split out?

As the current Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) unstaller, what is this task stalled on?

This is blocked on LangCom approval for the remaining languages.

The request isn't complete: it lacks samples.

The request isn't complete: it lacks samples.

I added in comments above?

Now I understand this better, and I support them all.

Esc3300 renamed this task from Add monolingual language code nan-hani, cdo-hani, hak-hans, hak-hant to Add monolingual language codes nan-hani, cdo-hani, hak-hans, hak-hant.Jul 11 2021, 12:37 PM
Esc3300 updated the task description. (Show Details)

Statistics about monolingual language code use in Wikidata (7 June 2021)

Thanks to @Ladsgroup we now have basic data on monolingual code use in Wikidata:

Great. I included them in Help:Wikimedia_language_codes/lists/all through Template:Tr_langcodes_counts_monolingual_text. It would be good if the data there could occasionally be updated.

Change 705161 had a related patch set uploaded (by Mbch331; author: Mbch331):

[mediawiki/extensions/cldr@master] Add monolingual codes mix, cdo-hani, hak-hans, hak-hant

https://gerrit.wikimedia.org/r/705161

Change 705162 had a related patch set uploaded (by Mbch331; author: Mbch331):

[mediawiki/extensions/Wikibase@master] Add monolingual codes mix, cdo-hani, hak-hans, hak-hant

https://gerrit.wikimedia.org/r/705162

Change 705161 merged by jenkins-bot:

[mediawiki/extensions/cldr@master] Add monolingual codes mix, cdo-hani, hak-hans, hak-hant

https://gerrit.wikimedia.org/r/705161

Change 705162 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Add monolingual codes cdo-hani, hak-hans, hak-hant

https://gerrit.wikimedia.org/r/705162

Seems to be live, at least nan-hani

That one was done before. The other languages aren't live yet, because it was merged after the cut off date for this weeks train. It will be rolled out with next weeks train.

Seems to be live, at least nan-hani

That one was done before. The other languages aren't live yet, because it was merged after the cut off date for this weeks train. It will be rolled out with next weeks train.

These should be rolling out today then? :)