Page MenuHomePhabricator
Paste P9640

Language metadata at Wikimedia Hackathon 2019 notes

Authored by Nikerabbit on Nov 14 2019, 10:16 PM.
Referenced Files
F31077581: raw.txt
Nov 14 2019, 10:16 PM
Attendees: Niklas, Amir, Kelson, Leszek, Emmanuel
list of places that contain language metadata
MW: languages/data/Names.php
2x mobile apps: language lists for RTL
MobileFrontend/Minerva: rtl languages
Wikidata Monologual codes
MW: special css rules for lin0height in some languages (should be for writing systems)
Wikidata Lexeme
Override for ULS
ISO code
where it is spoken (continent)
writing system (incl. directionality)
fallbacks (planned)
Amir: The following should be merged into central place:
MW: languages/data/Names.php
2x mobile apps: language lists for RTL
MobileFrontend/Minerva: rtl languages
Wikidata Monologual codes
MW: special css rules for lineheight in some languages (should be for writing systems)
Q: Why not use ICU?
A: Might not have all languages? Slow to upgrade.
ACTION: Consider further using/integration with ICU?
Panlex people claim Unicode contains grammatical rules for various languages. Would these also be in CLDR?
Amir: not sure, would need to check.
Amir: Why Wikidata maintains the custom list of language codes for monolingual codes?
Leszek: to allow using language code on top of the list provided by Mediawiki to use in Wikidata statements.
?: okay, so we have this library. Why not use other standard language libraries? Those are backed up by big consortia, which could update, maintain the data
ACTION: PHP binding for language data
A: We might have more languages
A: Also corporate parties are generally not interested in smaller languages as these might not have monetary value
Niklas: Wikimedia is actually member of Unicode. We also have a contact person at CLDR.
Niklas: CLDR might also require that language has a written code
Amir: Also for MediaWiki we don't want all languages from CLDR (e.g.extinct ones)
Emmanuel What does
ACTION: Mark which languages in language data can be content language for MW
N: We should make it clear what lists serve which context. If we just merge all the list together, we would make it even harder to understand which language/language code is suited for which context
ACTION: Share knowledge how does Kiwix use ICU.
Why Wikidata has their own restricted language?
ACTION: Document the policy adding stuff to language data
Discussed spectifics of Wikidata Lexicographical Data. It currently does allow adding data in non-MW language codes (using the "mis" language code)
There are better sources defining language codes/languages than MediaWiki, like Ethnologue,
N: How many of those different language list do we need?
1. MW content languages
2. Languages that would be translation targeten
3. Wikidata monolingual language
How about Sumerian language wiki source - which currently is not MW language
Language allows defining language codes with dashes, which are considered variants
A: Maybe we could have a matrix/table: language code - allowed for content, allowed for localization, allowed for Wikidata
ACTION: Task for polite grammar de, nl, hu, jv, su
It is difficult for third party software like Kiwix when non-standard language codes are used
What language list does Commons App use?
We use the device language, users can also change
When you support structured data on Commons, how are you going to match this language code with the possibly non-standard Wikibase language code
N: This is also a problem in MW, as structured data can use language code that are MW allowed languages
A: What about fallback, this is also some kind of metadata. Do we have a task to add fallback data to language data?
N: It is in the task T190129. The provided list of fallbacks should probably be reviewed, as some of them might not make sense in certain use cases?
ACTION: add fallback information to language data
language data library is maintained/owned by the WMF Language
When you are not logged in and go to Wikidata, the UI is in English
Q: When do we get Wikidata monolingual lang codes to language data?
There should be a way to distinct what language code lists between different "contexts"

Event Timeline

Nikerabbit changed the title of this paste from Language metadata at Wikimedia Hackathong 2019 notes to Language metadata at Wikimedia Hackathon 2019 notes.Nov 15 2019, 1:25 AM