Page MenuHomePhabricator

add monolingual code "und-latn"
Open, Stalled, Needs TriagePublic

Description

Per request from the community, we should consider adding the monolingual code "und-latn".

Request: https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#add_monolingual_code_%22und-latn%22
Community discussion and usecases: https://www.wikidata.org/wiki/Property_talk:P969#%22und%22_or_%22und-latn%22

Event Timeline

@Amire80 @jhsoby Before we move forward, could you have a look and let us know it it looks relevant from your side? Thanks!

It's been two weeks and we didn't hear any veto from @Amire80 or @jhsoby, so I think we can move forward with it. @Mbch331 will you prepare a patch that we can merge?

Oh, I missed it, sorry! It's not so usual. Give me a day to check it. If you don't hear more in 24 hours, go ahead.

Actually, please, no.

At least not without better examples.

None of the examples in the community discussion justify a different code. All the examples are just not-so-correct values of a property that is marked as "deprecated". Are there any other examples?

Lea_Lacroix_WMDE changed the task status from Open to Stalled.Nov 30 2020, 11:52 AM

Waiting for the result of discussions here and positive feedback from Amir or another LangCom person.

I think stuff was added to "und" for now.

I think stuff was added to "und" for now.

And again, are there any direct examples? Like with "mul", I don't understand what is it for.

The linked page lists several samples now archived at Wikidata. These have been converted to P6375 statements:

So, actually, not only "und" was used.

The linked page lists several samples now archived at Wikidata. These have been converted to P6375 statements:

So, actually, not only "und" was used.

All of these are just wrong, and "und" is not necessary in any of them. It's supposed to be Ukrainian, Russian, Tajik, Japanese. In all these cases, a dedicated code would just perpetuate data that is sloppy and easily fixable.

The linked page lists several samples now archived at Wikidata. These have been converted to P6375 statements:

So, actually, not only "und" was used.

All of these are just wrong, and "und" is not necessary in any of them. It's supposed to be Ukrainian, Russian, Tajik, Japanese. In all these cases, a dedicated code would just perpetuate data that is sloppy and easily fixable.

I think the sloppiness was caused by the lack of adequate language codes. "und-latn" would have been that and still could be (but it's now harder to apply). As @Lydia_Pintscher mentioned, it's not a life-or-death situation, but inaction and delays in the addition of the IETF language tags to Wikidata can lead to a deterioration of data quality at Wikidata.

Not sure where you want to go with "It's supposed to be Ukrainian, Russian, Tajik, Japanese":

  • technically it would be correct to use "ru" or "uk" for Latin script text in these languages, but I don't think this is desirable at Wikidata. AFAIK, it's generally not being used that way in Wikidata.
  • if you think that Wikidata shouldn't store structured data for the samples given above, that is something you should propose and discuss as a Wikidata contributor in the adequate forum (e.g. Project chat). Here we try to determine the appropriate language code for the sample texts with help of a review by langcom.

"und" is for undetermined languages. These languages are determined. This was discussed on Wikidata pages and in Phabricator, and I haven't yet seen a single example of a value in an undetermined language.

Any text is in an undetermined language until the actual code is set.

Can you list the codes you deem appropriate for the 5 samples given?

Any text is in an undetermined language until the actual code is set.

Can you list the codes you deem appropriate for the 5 samples given?

I already said this today, and in the past on Wikidata: Ukrainian, Russian, Tajik, Japanese. These addresses can be written in the respective scripts of the respective languages. If someone wants to write them in transliteration, then it's a transliteration to a certain language, probably English. None of these are undetermined. I'm not going to repeat this yet again.

I don't think you are answering the question directly. The question is merely about the text at hand.

If you think it's "probably English" and we should be using "en", at least that's an answer to this task.

This would be useful for quoting mis- or poorly transcribed transcriptions of text, as well as the titles of works which consist of invented words.