Page MenuHomePhabricator

Support language wildcard in wikibase:label service
Closed, ResolvedPublic

Description

As this query example indicates, we need a way to specify "a,b,*" language list, which means - give me "a" if it exists, or "b" if it exists, or just get me any language that's available, whichever you decide.

Language lists without the wildcard (e.g. "a,b") should continue to work as currently and provide QIDs.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

"Any" is kind of poorly defined, I'm not sure how this would be implemented - i.e. would it be OK if the same query would return different results? How useful is when the same item gets once named in Chinese, once in Hindi and once in Ukrainian, on the same query?

What is the use case for this, how it is intended to be used?

@Smalyshev the goal here is to get anything readable instead of the Wikidata ID, and not require the user to list all 500 of the language codes. From my personal perspective, anything based on the Latin character set should be shown ahead of all others, followed by Cyrilic, then Greek, and then everything else. The order might be different for others of course, but I suspect the Latin character set has the highest universal appeal than any random script that is not known by a particular user. It might make sense for the query service to have a predefined internal list of languages to make results consistent, or to simply sort by language family + code, and return first available.

the goal here is to get anything readable

I get this, but how Chinese label is anything readable for a person who doesn't read Chinese?

anything based on the Latin character set

Ok, this is a bit presuming but I guess a workable heuristic. This means we probably need a predetermined prioritized list of languages, probably a configurable one. I wonder how much value we will get beyond "always fallback to English" though. I.e. how many entities do not have English label but do have label that would be readable by significant percent of readers and in how many cases we'll make the right choice?

In short, I understand the idea now but I am not sure how feasible it is to implement consistently, how much value it would add (need more data on that) and how well it would be received by the users when we create a fixed ranking of languages. I'd think going beyond Latin charset has most potential for people being upset.

I think it depends also which items one has in mind:

  • The majority of items have labels in just one (maybe two) languages.

For these, it helps seeing the script of the label, even if one doesn't understand it.

I also think that this behavior should be default for wikidata.org itself - its very annoying to see a list of 5 undefined labels, and need to expand it. Even worse is to see a list of Q numbers instead of ANY labels at all.

I think it's better to see Q - that makes you go and update the label :)

I disagree - it makes people frustrated. When I try to find "what links here", and all I see is a large list of Q numbers, I will give up and not deal with it. If I look at an item and see a long list of administrative subdivisions, again, same thing.

In case it wasn't clear, I added a description of the expected behavior when no wildcard is provided: so none gets Chinese who isn't asking for it.

See discussion at T89213: Allow fallback to any language.

Probably we should define a "default label" per https://phabricator.wikimedia.org/T89213#1065554 and show it if wildcard is used.

It's relatively easy to show random language, what is still sounding questionable to me is that if you want French label and there's none, showing Chinese one would be useful for you. Also note, that having "any" means it would not be stable (since stable order requires some rule) so it means once label can be Chinese, another time Russian, another time Farsi - for the same query. I'm unclear which use case would work well with this.

@Smalyshev you are right that it shouldn't be random. Instead, we could establish a well known list of the fallback languages. I would argue that latin-based languages should be first in that list, followed by the "closeness" to latin alphabet - e.g. if there are no known latin-language, use the next script that has the highest number of speakers or the number of Wikipedia readers, but is the closest to Latin. E.g. Russian probably before Greek, but Greek before Chineese. Or something along those lines. It really doesn't matter what order we choose, as long as there is a way to get something. Having nothing is always the worst.

Personally, I'd pick the language of the first label ever added to the item.

Smalyshev triaged this task as Medium priority.Jul 25 2017, 6:12 PM

@Esc3300 that would be a nice heuristic, except WDQS unfortunately does not have this information - labels do not carry dates attached to them, they are just simple triples.

Gehel claimed this task.

How is this resolved? "en,*" still doesn’t work as far as I can tell.

(Was the task supposed to be closed as declined instead?)