Page MenuHomePhabricator

Add lexeme language codes ccp, ccp-beng, rhg-rohg
Closed, ResolvedPublic

Description

This ticket is to add the language codes for the representations of lexemes and forms in 1) Chakma, unqualified (for the Chakma script) and with the ISO 15924 subtag (lowercased) for the Bengali script, and 2) Rohingya, with a subtag for the Hanifi script.

(There is evidence of Rohingya being written in scripts other than Hanifi, such as Arabic, Latin, and sometimes Burmese, and the request for the specific script subtag for Rohingya is intended to avoid an unexpressed preference, via an unqualified language code, for one script over another, and to allow for future requests for other script codes when desired by others.)

Event Timeline

The languages are probably legit, but is there anyone who actually knows them and plans to add lexemes in them?

The languages are probably legit, but is there anyone who actually knows them and plans to add lexemes in them?

@Mahir256 Can you answer this question?

The languages are probably legit, but is there anyone who actually knows them and plans to add lexemes in them?

Both Chakma and Rohingya are quite alive, and to my knowledge Chakma is well (whether Rohingya is also well depends on whether the people who speak it are, I suppose). I am not aware of individuals besides myself planning to add lexemes in these languages, and I have not got in touch with anyone from these communities, especially the latter, who are working on documenting their languages (although I am open to doing so once the framework of Wikidata's lexicographical data can be better explained to them using examples from their own languages).

I am interested in modeling information within the framework of Wikidata's lexicographical data about languoids in the Eastern Indo-Aryan dialect continuum ranging from Manbhumi in the southwest to Sylheti in the northeast, from Rohingya in the southeast to Rangpuri in the northwest. To this end I've collected a number of references about varieties across this region, which I am citing when creating lexemes in these varieties, but such references for Chakma and Rohingya are rather scattered and less stable (disparate PDFs, blogs, and other sites) than those for the other varieties (mostly books with WorldCat entries and scans).

Rather than waiting for more concentrated and stable resources to come about, the fate of these peoples by which time is less certain, I plan to incorporate these scattered resources as individual citations on lexemes. I do not intend to make my own judgments regarding grammatical inflections thereof--not even by extending paradigms given in a reference, as irregularities may not be consistent--so the lexemes themselves may be comparatively empty (with only one form and only one sense), but the circumstances of their use, as well as their relationships with other lexemes in related varieties--and also with other languages for etymological reasons--can be better drawn out with lexicographical data. A first step in making this entire endeavor easier would be to add the appropriate language codes for these languages.

I hope my efforts in this regard can serve as a model for those wishing to handle other dialect continua using Wikidata's lexicographical data in their respective parts of the world. (It's interesting that these questions were not also asked of T271589.)

Thanks a lot for the detailed explanation!

The languages are clearly legitimate, but I'm not entirely sure what to do about the scripts here. For what it's worth, there's a bit of content in both of them in the Incubator, in their own scripts. If at the moment no particular person is actually planning to add written lexemes in them, and codes are only used for linking between languages, it makes sense to me to simply add ccp and rhg. When anyone actually wants to add written content, adjustments can be made if needed.

Did I misunderstand anything? Does anyone have another opinion?

(It's interesting that these questions were not also asked of T271589.)

That's because there isn't much to say about those languages' writing systems. I guess there is some variation there, too, but it appears clear that they are mostly written in the Bengali script (but do correct me if I'm wrong).

In the case of Chakma, there are cases of resources, such as those hosted by https://github.com/kalpataruboiChakma/, using both the Bengali script and the Chakma script, between which maintaining correspondences in lemmata/forms with different representations would be useful. Having ccp unqualified (in line with the script used on Incubator) and ccp-beng as language codes would thus not be controversial in my view.

In the case of Rohingya, however, there are disparate groups within the community that separately use the Latin and Hanifi scripts, and among those using the Latin script most outside the community (such as the Australian government, the SBS, and even Milwaukee) use the "Rohingyalish" system while some others may use a somewhat different system. Because of this split in script usage, I'd like to avoid courting potential controversy on Wikidata by suggesting that one script should be preferred over the others through an unqualified language code (especially since the Latin script is pursued by those who may find it easier to disseminate for both simplicity and technical reasons). I do not plan to add representations for lemmata/forms in the Latin script as personally I find it a bit deficient in capturing some sounds in the language well, although I'm not against someone else later proposing the rhg-latn code if they can enforce the use of one Latin-script system over another.

If at the moment no particular person is actually planning to add written lexemes in them, and codes are only used for linking between languages

I'm not sure how you're imagining the codes being used. What does it have to do with linking between languages?

We do already have one lexeme for Chakma: https://www.wikidata.org/wiki/Lexeme:L230063 - as you can see, we've had to enter it using the language codes mis and mis-x-Q756802 because it won't let us use the right ones.

Change 661183 had a related patch set uploaded (by Mbch331; owner: Mbch331):
[mediawiki/extensions/WikibaseLexeme@master] Add language codes ccp, ccp-beng, rhg-rohg and syl-beng

https://gerrit.wikimedia.org/r/661183

Change 661183 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add language codes ccp, ccp-beng, rhg-rohg and syl-beng

https://gerrit.wikimedia.org/r/661183

Addshore added a subscriber: Addshore.

Leaving in verification until this is deployed (could be this week)

(@Mbch331 in your next set of patches for language codes, you should fix the typo in the name of the interface messages for the Rohingya code--both "en.json" and "qqq.json" use "rhog" instead of the correct "rohg".)

@Mbch331 why did you move this back to Peer Review? I don’t see any remaining changes to be reviewed.

Change 663943 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Mbch331):
[mediawiki/extensions/WikibaseLexeme@master] Add lexeme languages ms-arab and rah

https://gerrit.wikimedia.org/r/663943

Change 663943 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add lexeme languages ms-arab and rah

https://gerrit.wikimedia.org/r/663943