[Bug] wbsearchentities API defaults to something else than the given language parameter
Open, HighPublic

Description

The change in https://gerrit.wikimedia.org/r/#/c/219168/14/repo/includes/api/SearchEntities.php broke the wbsearchentities API. Now it partly ignores the "language" parameter. The search is done in the language given via the parameter, but the returned labels and descriptions are always in the users current language set in it's wikidata.org cookie.

To reproduce:

This bug is related to T98172: [Story] Implement Unit Selector widget because it relies on this API and it's ability to be localized.

thiemowmde updated the task description. (Show Details)
thiemowmde raised the priority of this task from to High.
thiemowmde added a project: Wikidata.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 19 2015, 2:14 PM

It does not ignore the language param, it searches in that language as is said in the description
https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/repo/i18n/en.json#L576

The issue mentioned is a use case that we did not consider.
I have opened a new Task @ https://phabricator.wikimedia.org/T109595

Addshore closed this task as Invalid.Aug 19 2015, 4:35 PM
Addshore claimed this task.

Invalid as not a bug

And the task I just opened I have also just closed as we can use &uselang=LANGCODE in this case

thiemowmde reopened this task as Open.EditedAug 19 2015, 5:04 PM

Sure, it searches in that language but returns something else. In https://www.wikidata.org/w/api.php?format=jsonfm&action=wbsearchentities&search=Meter&language=de I am not asking for results in English or French or something. I am explicitly asking for results in German. This is ignored.

Change 232724 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Fix incomplete wbsearchentities options in entityselector

https://gerrit.wikimedia.org/r/232724

Change 232725 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Fix incomplete wbsearchentities options in entityselector

https://gerrit.wikimedia.org/r/232725

API sandbox

I enter a search term and fill in the required "language" parameter:


The "language" parameter is described as "Search in this language". But this is not what's happening. The results I get are in an other language. A user of this API haves no idea of what's going on. He can not tell, not when looking at the Api sandbox, not when looking at the wbsearchentities API documentation, not even when looking directly at the SearchEntities.php code (that's what I did in the end).

There is a "Generic parameters" section with the "uselang" parameter. But this is described as "Language to use for message translations". Ok, so it is for exception messages? Fine. How are the labels and descriptions in the search results I requested "messages" that need "translation"? Search results are content. "uselang" is not for content.

Defaults

Using the users language as a default is fine. It's clearly better than simply falling back to English.

But the "language" parameter does not even allow a default. It's a required parameter. The user must specify the language he wants. So why is this ignored and the results are delivered in an other language by default? Why is the default not the language the user explicitly requested?

Use cases

What's the use case for the "feature" to ask for something in language A and get it back in language B? Which application currently uses this, why and for what?

I could think of a rare use case where a user happens to know the name of a city in Korean, does not know what it means but know it's Korean, and wants an intermediate translation in the search result: https://www.wikidata.org/w/api.php?format=jsonfm&action=wbsearchentities&search=서울특별시&language=ko&uselang=en. Two problems:

So again, what's the use case for having a default that returns search results in a language different from the language of the search term? As an option, why not. But why by default?

Solutions

  • Why is language A suppressed? Why aren't both included in the result, the original language of the search space and the translated language the user asked for (if he asked for one)?
  • Simply default to the language given by the "language" parameter if no "uselang" parameter was set.
  • Do not use "uselang" for content and do not confuse it with the language explicitely requested via the "language" API parameter.
  • Introduce a new parameter and give it an expressive name, for example "translate" or "output-language".

Change 232724 merged by jenkins-bot:
Fix incomplete wbsearchentities options in entityselector

https://gerrit.wikimedia.org/r/232724

Change 232725 merged by jenkins-bot:
Fix incomplete wbsearchentities options in entityselector

https://gerrit.wikimedia.org/r/232725

I am still confused as to why this is still open.
The bug described is not a bug!

I wrote a whole page to describe the issue and possible ways out of it. The bug is that the implicit users interface language overrides the language the user explicitly requested. If no uselang parameter is given the language parameter must have higher precedence. I do not know how much simpler this bug description could be.

thiemowmde renamed this task from [Bug] wbsearchentities API ignores given language parameter to [Bug] wbsearchentities API defaults to something else than the given language parameter.Sep 2 2015, 6:26 PM
thiemowmde removed Addshore as the assignee of this task.
thiemowmde set Security to None.
thiemowmde removed a subscriber: gerritbot.

So when no uselang param is explicitly passed to the api then fallback to the language param as the display language?

To what does "display language" refer to? Error messages should be in the users language. Search results should be in the language the user is searching in. Again, I don't know how this bug's description could be more simple and obvious.

The module returns:

  1. the matched term in the language of the search (language param)
  2. details about the entity matched in the display language (uselang param)
@Addshore wrote:

The module returns [...] details about the entity matched in the display language

Did you ever tried to build a renderer on top of this?

{
    "search": [
        {
            "match": {
                "type": "label",
                "language": "ko",
                "text": "\uc11c\uc6b8\ud2b9\ubcc4\uc2dc"
            }
        },
        {
            "match": {
                "type": "alias",
                "language": "ko",
                "text": "\uc11c\uc6b8\ud2b9\ubcc4\uc2dc \uc9c0\ud558\ucca0\uacf5\uc0ac"
            }
        }
    ]
}

It may be a stretch, but I do not think anybody is using these bits of the JSON. People use the "label" and "aliases" fields.

But even without that, what's the point of returning some fields in one language and other fields in the same data structure in an other language? What's the use case for that? As I said, as an "immediate translation" option, why not. But why by default?

I also stumbled upon this weird behaviour. The Accept-language HTTP header is also ignored but response language for some fields comes from a cookie (?!). Luckily parameter uselang can be used to control the response language but this should definitely be included in API documentation.

I found an example where even uselang and language does not help but part of the response is always English:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%C3%84rzte%20ohne&language=de&format=json&uselang=de

contains the result

{
    "id":"Q49330",
    "concepturi":"http://www.wikidata.org/entity/Q49330",
    "url":"//www.wikidata.org/wiki/Q49330",
    "title":"Q49330",
    "pageid":51315,
    "label":"\u00c4rzte ohne Grenzen",
    "description":"organization"
    ,"match":{"type":"label","language":"de","text":"\u00c4rzte ohne Grenzen"}
},

The current entity Q49330 has no German "description" field so the English description is returned instead.

The API is definitely broken. Either the full result should be one language or all strings should be marked with a language, e.g.

"label": { "de": "...", "en": "..." },
"alias": { "de": ["..."] },
"description": { "en": "..." },

I'd like to specify multiple languages to search in and multiple langauges to get result in. For instance search in Greek and Turkish (when looking for a name that could be in any of these two languages) and get results also in English (beause I don't read first two languages).

So this is due to language fallback.
DE falls back to EN.
Again I don't see anything in the comment that we did not explicitly design into this version of the API module.

You search in a language (with fallback)
The matched result is always returned to you, with what the result is, the language it is in, and the actual result.
You are given the entity ID, the concept and url, as well as the title and pageid.

For convenience you are then also given a label and description to display in the current user language, or language specified with uselang (which again has fallback)
This is, as said, provided for convenience...

If the excluded the label and description field from the result would you still see the module as broken?

At least the documentation lacks a clear description of this complex language negotiation mechanism (uselang is not mentioned. Meaning of 'strictlanguage' unclear, examples only refer to English).

The broken part in design, however, is the lack of language tags in response format. Either language should correspond to uselang or it should be marked explicitly. How should a client know about DE falls back to EN or know about the language inferred from a use cookie? As I understand now, a response can include strings in at least three different languages at the same time, so it would be better to tag every strings with its language.