Investigate and decide the representation of languages and variants in Lexeme entities
Open, MediumPublic
Actions

Assigned To

None

Authored By

	daniel
	Nov 25 2016, 12:05 PM

Description

We need to specify:

the "language" of a Lexeme
the "language" of each Term of a multi-variant lemma (and Term representation, and Sense glosse)

There are essentially two options for representing a language or variant:

an ISO code, or a IETF language tag
a Q-Item reference (possibly mapped to an ISO code or IETF tag somehow)

There are several kinds of variants, mainly:

spellings
scripts
dialects

Of these combinations:

some have valid IETF tags
others have a valid fall-back tag
others don't have valid tags (or would have internal tags)

There are several aspects that need to be considered:

(A) input of language/variant for a term
(B) definition of applicable IETF tag/update of tag for a language variant
(C) output

(C) For representing lemmas (and representations and glosses) in HTML and RDF, we need IETF tags ~~ISO language codes~~. We can put ISO codes int Statements of items about languages, but then these statements must not change. We can make up ISO codes from Q-Ids, e.g. qqd-Q7832478, but that's not very useful, since consumers don't understand them.

Currently, monolingual text uses:

(A) input in the form of code or selection from list of names
(B) definition in a special (internal table), mostly based on existing interface language and update through a somewhat time-consuming and opaque procedure. Only a few codes have accordingly been added.
(C) storage in the form of codes

Units (for quantities) might initially have followed a model similar to monolingual text, this has then been simplified to model using QIDs.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T13996 A way to select which parts of Wiktionary articles to show
Open	Feature	None	T14213 Following a link to a language entry in Wiktionary should display only that entry
Open	Feature	None	T13998 A way to show only those languages on Wiktionary that the user is interested in
Open	Feature	None	T38881 Wiktionary needs usable API
Open		None	T31229 Extension to provide access via the dict protocol
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Resolved		Lydia_Pintscher	T986 Use structured data on Wiktionary
Resolved		Lydia_Pintscher	T988 Phase 1: Represent Wiktionary lexicon using structured data
Resolved		Lydia_Pintscher	T146637 Wikidata 2016 Q4 goals
Resolved		Lydia_Pintscher	T150179 Wikidata 2017 Q1 goals
Resolved		Lydia_Pintscher	T146662 [Story] new entity type for Lexemes (baseline)
Open		None	T151626 Investigate and decide the representation of languages and variants in Lexeme entities

Event Timeline

daniel created this task.Nov 25 2016, 12:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2016, 12:05 PM

daniel triaged this task as Medium priority.Nov 25 2016, 12:08 PM

daniel added a parent task: T146662: [Story] new entity type for Lexemes (baseline).

daniel mentioned this in T151582: Decide whether a Lexeme's lemma is a single Term, or a TermList (multi-variant)..

daniel updated the task description. (Show Details)Nov 25 2016, 12:10 PM

daniel added subscribers: • iecetcwcpggwqpgciazwvzpfjpwomjxn, Denny, thiemowmde and 2 others.

Reedy added projects: Wikidata, I18n.Nov 25 2016, 2:18 PM

Reedy removed a subscriber: Wikidata.

Nikki subscribed.Nov 25 2016, 4:18 PM

daniel mentioned this in T152019: Decide whether a Lexeme's lemma can have multiple representations for the same language code..Nov 30 2016, 5:34 PM

Jakob_WMDE mentioned this in T153674: Create validator for lemma.Dec 19 2016, 12:26 PM

• iecetcwcpggwqpgciazwvzpfjpwomjxn added a comment.Jan 6 2017, 1:16 PM

This comment was removed by • iecetcwcpggwqpgciazwvzpfjpwomjxn.

The French Wiktionary is currently using the fixed list approach for languages, administered by the community. We use ISO codes to represent languages almost everywhere (e.g. the title of a language section is {{langue|fr}}, which is displayed as "French"). We do not have identifiers for dialects and we usually use a geographic term. Not sure if those should be as controlled as languages.

Esc3300 updated the task description. (Show Details)Jan 6 2017, 8:48 PM

I expanded the description above. Here a few precisions and an explanation of a possible solution.

The other day we looked into the way that the dialect of Finnish place names should be specified (discussion, WP article on some).
It seems that a few may have ISO codes/IETF tags, others don't. In the current way monolingual text codes work, I think one would first have to ask the relevant IETF appointed administrator for these tag to define a variant tag and then wait for WMDE to add them. Alternatively, if we leave the interface open as we do for new item, one could just make up a code on the fly. The later isn't really desirable.
To ensure that the variant is identified, one would need to define it through a QID of an item. This as not all variants have IETF-tags or have IETF-tags known to Wikidata.
Whenever possible, output should use IETF tags. These could be defined on the item of the language.
If no IETF is defined, the item of the language could also offer a fallback. (just "fi" instead of "fi- hämäläismurteet"). Possibly a fallback could also be to "fi-x-QID".
Once in a while, we could request new IETF tags for variants that need them.
To simplify input by users (and bots/tools), it should be possible to input IETF tags (e.g. "de") instead of a QID.

Esc3300 added a subscriber: Susannaanas.Jan 6 2017, 9:06 PM

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 26 2017, 11:52 PM

Aklapper added a project: Wikidata Lexicographical data.May 3 2020, 8:26 PM

Aklapper removed subscribers: • iecetcwcpggwqpgciazwvzpfjpwomjxn, Wikidata Lexicographical data.

Investigate and decide the representation of languages and variants in Lexeme entitiesOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate and decide the representation of languages and variants in Lexeme entities
Open, MediumPublic
Actions

Related Objects
Search...