Page MenuHomePhabricator

wbsearchentities for lexemes returns 'und' match language unless the language has a two-letter language code
Open, HighPublic

Description

On testwikidata, wbsearchentities results for lexemes report the match language code 'und' (“undetermined” per ISO 639-3) instead of the actual language code of the lexeme.

$ curl -sG https://{test,www}.wikidata.org/w/api.php -d action=wbsearchentities -d search=Ding -d language=de -d type=lexeme -d format=json | jq -r .search[0].match.language
und
de

One effect of this is that the Wikidata Lexeme Forms duplicate detection doesn’t work for templates targeting test.wikidata.org (like german-noun-neuter-test), since the tool thinks the search results are false positives for lexemes in another language and therefore not duplicates.

Event Timeline

I think this is also affecting real Wikidata (though I’m not completely sure it’s the same issue). Searching for “das” with language code ku (Kurdish) returns L221811 and L221957, but with language code und instead of ku. This might be because the language item used for the lexemes is Kurmanji, not Kurdish languages – only the latter has a Wikimedia language code statement. (The difference between the two still isn’t quite clear to me – perhaps we should have used the more general item after all. This was discussed (in German) here.)

In general, I suspect the language of the match is the Wikimedia language code of the language item, not the language code of the lemma matched. You can also see this when searching for “colour” – the match language is en (or, for the Middle English version, und) rather than en-gb (or mis).

In general, I suspect the language of the match is the Wikimedia language code of the language item, not the language code of the lemma matched. You can also see this when searching for “colour” – the match language is en (or, for the Middle English version, und) rather than en-gb (or mis).

(On the other hand, if it actually returned en-gb, Wikidata Lexeme Forms would no longer recognize it as a duplicate. Hm.)

Mentioned in SAL (#wikimedia-cloud) [2019-11-11T23:30:26Z] <wm-bot> <lucaswerkmeister> deployed cd4239904a (work around T230833)

My guess would be that it's returning und for any language where the lexeme's language item does not have a P218 (ISO 639-1 code) statement. Test Wikidata doesn't have the same properties, let alone the same statements, so it would never find a code there, whereas on Wikidata itself it only affects the languages without ISO 639-1 codes.

Another example which has Scots (ISO 639-3 sco) and Northern Frisian (ISO 639-3 frr) and yet another which has Australian English (IETF language tag en-au) all showing up as und.

I believe Nikki is right.
My first reaction is it should return und as it does and get people to add the correct statements. Thoughts from others?

But what would the correct statements be? We can't add an ISO 639-1 code if the language doesn't have one! :) All the ISO 639-1 codes which exist are (or should be) already in Wikidata - there's only ~200 and new ones are not being assigned any more. If we want to return anything other than und for the thousands of other potential languages, we would need to use something else like P220 (ISO 639-3 code) or P305 (IETF language tag).

Using IETF language tags seems like the most useful solution to me. ISO 639-1 is too limited. Only using ISO 639-3 would be awkward because those are always three-letter codes (e.g. en would turn into eng, de into deu). Falling back to ISO 639-3 if there's no ISO 639-1 code would be an improvement, but that's essentially how IETF language tags are assigned - English has the ISO 639-1 code en, ISO 639-3 code eng and its IETF language tag is en, Scots only has the ISO 639-3 code sco and its IETF language tag is sco.

Here's a query of all the languages being used for lexemes right now which don't have an ISO 639-1 code: https://w.wiki/PXQ

Using IETF language tags seems like the most useful solution to me

I dimly recall a similar discussion from years ago. IIRC, IETF is extensible, and we came up with a way to encode item IDs in language tages, something like qid-36163 (by fortunate coincidence, "qid" lies within the range for private use tags, between "qaa" and "qtz"), or und-x-wikidata-Q36163 (the "mis" code should not be used, according to BCP47). Isn't Wikibase using this kind of encoding somewhere already?

This would be my solution for determining a language tag for Items that do not specify one. I don't understand the use case well enough to tell whether this would actually solve the problem at hand.

I dimly recall a similar discussion from years ago. IIRC, IETF is extensible, and we came up with a way to encode item IDs in language tages, something like qid-36163 (by fortunate coincidence, "qid" lies within the range for private use tags, between "qaa" and "qtz"), or und-x-wikidata-Q36163 (the "mis" code should not be used, according to BCP47). Isn't Wikibase using this kind of encoding somewhere already?

It uses it for lexemes - people can add -x-qid to an existing code (which doesn't always produce a valid tag but that's a separate issue :))

This would be my solution for determining a language tag for Items that do not specify one. I don't understand the use case well enough to tell whether this would actually solve the problem at hand.

There are some languages which don't have any usable language tags, but the cases being discussed here do.

Inventing tags would be better than nothing, since it would provide a way of distinguishing all the languages currently lumped under und, but only doing that wouldn't be a proper solution because we would be inventing tags for languages which already have them, e.g. Wikidata normally uses the assigned code sco for Scots, it would be weird and inconsistent for wbsearchentities to return und-x-q14549 instead.

Ahh, the property to use for language codes on test.wikidata.org is P220 - I added it to https://test.wikidata.org/wiki/Q348 and after I edited https://test.wikidata.org/wiki/Lexeme:L76 it started showing up in the API results correctly. The variable for it seems to be wgLexemeLanguageCodePropertyId.

(Of course, we'll still get und whenever we can't add a statement, just wanted to point out how to get any language codes on test.wikidata.org to work at all)

Mentioned in SAL (#wikimedia-cloud) [2021-02-11T22:25:50Z] <wm-bot> <lucaswerkmeister> deployed 81166d5c17 (reduce T230833 workaround / "und" language codes)

Nikki renamed this task from wbsearchentities for lexemes returns 'und' match language on Test Wikidata to wbsearchentities for lexemes returns 'und' match language unless the language has a two-letter language code.Jul 15 2021, 7:56 AM

We discussed this today in the bug triage hour and decided that we need to look at this again once T284882 is solved.

We discussed this today in the bug triage hour and decided that we need to look at this again once T284882 is solved.

Looking at what wbsearchentities actually returns these days, I think the way wbsearchentities provides language information is bad, independent of T284882.

Take this query, for example (made by the interface when following the instructions in T340407). For L481 it returns:

{
	"id": "L481",
	"title": "Lexeme:L481",
	"pageid": 54394603,
	"display": {
		"label": {
			"value": "माता",
			"language": "und"
		},
		"description": {
			"value": "Hindustani, noun",
			"language": "en"
		}
	},
	"repository": "wikidata",
	"url": "//www.wikidata.org/wiki/Lexeme:L481",
	"concepturi": "http://www.wikidata.org/entity/L481",
	"label": "माता",
	"description": "Hindustani, noun",
	"match": {
		"type": "label",
		"language": "und",
		"text": "ماتا"
	},
	"aliases": [
		"ماتا"
	]
}

The hi lemma is returned as the label. The ur lemma is returned as the match. Both are incorrectly tagged as und.
These use different scripts, which are written in different directions and need different fonts/styling, so it should be using the language code from the lemma, not attempting to fetch it from the language item.

The language of the lexeme is not included in an unambiguous (let alone machine-readable) way at all. I would expect it to include the QID, because neither a language name nor a language code can be guaranteed to only match one item.