Page MenuHomePhabricator

Selecting user language in the REST API
Open, Needs TriagePublic

Description

The action API allows setting the user language with the uselang parameter. This is distinct from the API's own "interface language" (ie. the language in which, for example, error messages are in), and influences content returned by the API (primarily, it influences how the MediaWiki parser behaves, and how MediaWiki interface messages are looked up). The difference is most relevant on multilingual wikis like Commons where the content itself significantly depends on which language you are reading it in, but API calls which are used to create user interface dynamically also rely on this.

In practice, uselang is used to override MediaWiki's internal user interface language setting, which is used by the i18n system, the parser and a couple other things. The action API framework does this (in ApiMain) by setting the language of the main RequestContext, which is then queried by various MediaWiki classes in various places in a hard-to-predict manner (which is unfortunate but something we'll have to live with for a while).

How should this happen in the REST API? One option is to leave it to the handler to define the language as an explicit API parameter, and then set it. But 1) this would mean that every handler would have to predict that it has a dependency on the language, and declare it (arguably not a bad thing), 2) this would mean that anything that depends on a service that takes a language parameter (and gets that from RequestContext::getMain() at service instantiation time) behaves unpredictably as the language will depend on whether the service got instantiated before the handler was called.
(In theory, ServiceWiring should not use RequestContext. But also, in theory, services should not directly call RequestContext::getMain() themselves, which is static coupling and makes unit testing hard. There is no way today to avoid both and I don't think there's much consistency in which one a given piece of code prioritizes avoiding.)

So, there would be value in providing a signal (a HTTP header, maybe; or it could rely on the user's language for authenticated requests) for setting the interface language early enough that it's guaranteed to be consistent during the service setup phase.

Related Objects

StatusSubtypeAssignedTask
In ProgressNone
OpenBPirkle

Event Timeline

See also T264777: Include error message translations in the user language in the REST API's error response. That's a different use case, but it might make sense to share the mechanism, and there will be caching considerations which should be fairly similar in the two cases.

I don't know much about the actual differences between the API's interface language and the "uselang language", but I would expect clients to be able to affect at least one of those with the Accept-Language header. Currently, providing the header has no effect on anything, not even error messages, AFAICS.

We have several tasks that mention different but related aspects of Core REST API language handling. It seems reasonable to discuss them together, so that we have a coherent, comprehensive plan. (We may still choose to split implementation into parts/phases.) Here are the tasks I see:

T269492: Selecting user language in the REST API (this task)
T311423: Provide a convenient way to obtain localized error messages in the JS REST API framework
T264777: Include error message translations in the user language in the REST API's error response

Combining the tasks, we need:

  1. a way for API callers to specify a language for the returned content (aka the "user language", similar in purpose to Action API's uselang)
  2. a way for API callers to specify a language for internal message (aka "API interface language", similar in purpose to Action API's errorlang)
  3. reasonable fallbacks if one or both of these is not specified

Some relevant documentation:

https://www.mediawiki.org/wiki/API:Localisation
https://www.mediawiki.org/wiki/API:Errors_and_warnings
https://www.mediawiki.org/wiki/Manual:Language#User_interface_language
https://www.mediawiki.org/wiki/API:REST_API/Status_codes

Some implementation possibilities and considerations (mostly taken from suggestions by various people in the various tasks):

  • deal with language in individual handlers via explicit API parameter(s) (potentially suffers from inconsistent behavior)
  • use the user's language preference (only applicable to authenticated requests)
  • use one or more HTTP headers, maybe including Accept-Language

Presumably, as with Action API, we will set the language of the main RequestContext, which is then queried by various MediaWiki classes in various places in a hard-to-predict manner. It would be best if whatever we chose was processed early enough that it's guaranteed to be consistent during the service setup phase. Otherwise, anything that depends on a service that takes a language parameter (and gets that from RequestContext::getMain() at service instantiation time) would behave unpredictably as the language will depend on whether the service got instantiated before the language specification was processed. (Most of the preceding paragraph is mix-and-match cut-and-paste from Tgr's task description.)

I did a bit of internet searching to see how others had dealt with this, to no avail. I saw people who simply didn't provide translations for error messages at all, and people who used the content language (usually specified via the Accept-Language header, sometimes via explicit url parameter). That was no help, but if anyone is aware of existing art that we might take inspiration from, I'd be happy to see it.

I'm tempted to use Accept-Language for the equivalent of uselang, but I'm not aware of an equivalent standard header that we could use to specify the desired language for error messages, and it seems inconsistent to use a header for a uselang equivalent and something else for the errorlang equivalent. I suppose we could define a nonstandard header, although I'm not very enthusiastic about it.

In the interest of moving the discussion forward, and assuming we did use a nonstandard header, things could work something like this:

Content (aka "user language"):

  1. is Accept-Language specified? Try to use it.
  2. else, is the user authenticated? Try to use their language preference
  3. else, use the site content language

The code for whatever language is used can be returned to the client via the Content-Language header.
`
Errors (aka "API interface language")

  1. is Out-Custom-Error-Language-Header specified? Try to use it.
  2. else, is the user authenticated? Try to use their language preference
  3. else, use the site content language

The code for whatever language is used can be returned to the client as part of a new field in the message response schema, maybe something like "MessageLanguage" or "MessageLanguageCode".

Note: I propose the above mostly to give us something tangible to criticize. Please ruthlessly pick it apart and suggest alternatives.

What did I miss (or misrepresent) in that giant wall of text?

The action API defaults to the user's language preference for userlang (and then defaults to the value of uselang for errorlang). I think that was a mistake; it makes all logged-in API requests uncacheable. That's not that tragic for the action API which has a cache-unfriendly URL structure anyway (although T138093: Investigate query parameter normalization for MW/services, which I think @ori is now working on, could fix that), but caching is supposed to be the core strength of the REST API. I would rather shove the responsibility of always setting the language headers/parameters to the client.

Similarly, the problem with using Accept-Language is that it's a noisy header, which makes using it for Vary: not really cache-friendly. E.g. my browser automatically adds Accept-Language: en-US,en;q=0.9,hu;q=0.8 to all AJAX requests. Someone who has configured their browser to accept English but not Hungarian would get Accept-Language: en-US,en;q=0.9 instead - a cache split, even though both responses are served in English.
(Maybe that's not that much of a problem in practice because few people customize their browsers' language settings? We can probably pull some stats on this from the analytics webrequest data. Also, I suppose Varnish could normalize the header, although that would still leave vanilla MediaWiki installations with not-so-useful API mechanics.)

Another, smaller issue with Accept-Language is that it is essentially a fallback chain, but MediaWiki has its own fallback mechanism - there isn't really a concept of unsupported language (unless it's a language in which MediaWiki localization has never even started), so if the browser sends Accept-Language: de,fr;q=0.9 then the expected behavior according to the spec would be to send the response in German if it's available and French if it's not, but MediaWiki's message layer, parser etc. just take the de parameter and return a string in who-knows-what language (English in that case, if the German version does not exist, but some other languages have more sophisticated fallback chains). It won't break anything but it is kinda unintuitive behavior.

The code for whatever language is used can be returned to the client via the Content-Language header.
...
The code for whatever language is used can be returned to the client as part of a new field in the message response schema, maybe something like "MessageLanguage" or "MessageLanguageCode".

Per above, this wouldn't necessarily be the language that was actually used. I think that's fine (or at least infeasible to improve), just noting.

Errors (aka "API interface language")

  1. is Out-Custom-Error-Language-Header specified? Try to use it.
  2. else, is the user authenticated? Try to use their language preference
  3. else, use the site content language

Nit: I would probably fall back to the Accept-Language if that's set but the custom header is not. No extra caching pains there, and it seems rare that someone would want the content in a given language but the errors in another language. We could always add the English error message as an extra field for developer-friendlyness.

@Krinkle maybe you have thoughts on the importance of making logged-in REST API calls cacheable; AIUI there are long-term plans / aspirations to make the logged-in UI rely more on the API and client-side composition and get rid of the currently fairly significant performance penalty for being logged in.

I guess one thing in favor of uncacheable-if-logged-in APIs is that it's harder to cause accidental info leaks, where the API author does not realize that some service that's being used returns different content based on user permissions (e.g. page history for oversighters) and the response gets cached and shown for unprivileged users. But there are probably better ways of protecting against that.

Thanks for looking into this. I don't have strong opinions on the approach. I would maybe prefer a standardized solution (like Accept-Language) to a custom one (e.g., custom header). At the same, time, if we were to use the standard solution we would have to make sure that we comply with the standards. If that is not the case (as noted in T269492#8110881 re fallback chains), then I'd prefer the custom solution.

En passant, this conversation about Accept-Language, Vary, and caching reminded me of T294848, a complex bug that affects the action API. I assume the REST API could also be affected.

Daimona said:

I would maybe prefer a standardized solution (like Accept-Language) to a custom one (e.g., custom header). At the same, time, if we were to use the standard solution we would have to make sure that we comply with the standards.

Agreed. As I understand it, we want the ability to specify different languages for content vs errors. I couldn't find a standard solution for that. Accept-Language was the closest I could find, but it only does half the job, and Tgr did a very effective job of pointing out its shortcomings.

Also, thank you for referencing T294848. The fact that the task includes a flowchart with eleven symbols and numerous branches, then also says the flowchart is a "simplified version", really highlights the complexity of all this. I also was not previously aware that $wgUsePigLatinVariant was a thing, so there's that. More practically, this task brings forward some caching-related pitfalls that we should be careful to not replicate.

I didn't even talk about variants in my above comment. While I'm unsure how valuable variant support for error messages is (compared to variant support for content) it would seem surprising and confusing if we had two systems for specifying language/variant that worked differently from each other. So unless there's a compelling reason to do otherwise, I'd prefer the language/variant specification for both content and errors to be as consistent as possible.

Tgr said:

AIUI there are long-term plans / aspirations to make the logged-in UI rely more on the API and client-side composition and get rid of the currently fairly significant performance penalty for being logged in.

Even if we're not doing that in the short or medium term, we should avoid any decisions that would preclude/complicate it. Hopefully we can design a comprehensive solution that would support the use case that Tgr mentions, even if we don't implement all of it at this time.

Tgr also said:

I guess one thing in favor of uncacheable-if-logged-in APIs is that it's harder to cause accidental info leaks, where the API author does not realize that some service that's being used returns different content based on user permissions (e.g. page history for oversighters) and the response gets cached and shown for unprivileged users. But there are probably better ways of protecting against that.

I wonder if making caching for logged-in users opt-in, by requiring API authors to take special steps to enable it for their endpoint/handler/functionality would help. It certainly wouldn't eliminate the possibility of leakage, as mistakes are always possible, especially as code is changed over time. And I haven't even started to think about how opt-in would be implemented. But it might be worth considering.

With all that said, I'm now leaning away from Accept-Language, if for no other reason than that having our own custom fallback system makes it problematic to respect Accept-Language in a standard way.

Tgr said this in the task description:

there would be value in providing a signal (a HTTP header, maybe; or it could rely on the user's language for authenticated requests) for setting the interface language early enough that it's guaranteed to be consistent during the service setup phase.

@Tgr, did you specifically mention a header because that would be more readily examined very early compared to a url parameter? Does that lead us towards (cringe) one complex custom header that supports language/variant choices for both content and errors? Or even (double-cringe) two custom headers?

I'd like to add some point that are coming to mind in the context on the work i'm currently doing on REST endpoints that return page content HTML:

When serving page content HTML, we have content in language A1 but the client may want language A2. We may then apply some transformation to convert the output.

In addition, the page may contain bits and pieces in the user's UI language B, which we may want to be able to override to be C using some parameter. Errors would also be reported in B or C, respectively.

The parsoid endpoints are currently using Accept-Language to trigger language variant conversion. Page content in the user language is not supported at all in Parsoid, afaik. We will have to find a solution for that soon, I suppose. Not sure how it handles the language of error messages...

I have only skimmed the discussion above. I expect that we'll dig into the topic of language variants some time in September.

I didn't even talk about variants in my above comment. While I'm unsure how valuable variant support for error messages is (compared to variant support for content) it would seem surprising and confusing if we had two systems for specifying language/variant that worked differently from each other. So unless there's a compelling reason to do otherwise, I'd prefer the language/variant specification for both content and errors to be as consistent as possible.

I think as far as i18n messages are concerned, variants are just separate languages. We have separate message files and separate Translatewiki message pages for most variant languages (including auto-convertible variants like sr-ec/sr-el) although seemingly not all (there's just one message file for shi for example)? Not sure what's up with that.

I wonder if making caching for logged-in users opt-in, by requiring API authors to take special steps to enable it for their endpoint/handler/functionality would help. It certainly wouldn't eliminate the possibility of leakage, as mistakes are always possible, especially as code is changed over time. And I haven't even started to think about how opt-in would be implemented. But it might be worth considering.

Making caching opt-in would mean sending Cache-Control: private or such for authenticated requests AND sending Vary: Authorization, Cookie (or such; AuthManager can provide the exact list of headers) on all requests (since otherwise some downstream cache might serve a previous anonymous response to a logged-in user - that's not a security leak, but an error nonetheless). The problem is that Vary is a very blunt tool, varying on all cookies would break caching entirely (Wikimedia production has Varnish logic to deal with this but the average installation wouldn't), and the Key header that would improve is is very far from being standardized, so as long as the REST API supports cookie-based authentication, I don't think this is plausible.

The alternative would be to require opt-in for authentication - by default there is no Vary header but the framework prevents the user from being detected, even if there is a cookie or bearer token in the request. That seems reasonably easy to do (and might also allow setting MW_NO_SESSION for such endpoints which is nice for performance, and would fix problems like T264631) but seems potentially quite confusing when authentication just does not work for no easily discernible reason. I guess some kind of warning could be used for that.

@Tgr, did you specifically mention a header because that would be more readily examined very early compared to a url parameter? Does that lead us towards (cringe) one complex custom header that supports language/variant choices for both content and errors? Or even (double-cringe) two custom headers?

I don't think there is a difference between headers and query parameters in how easy it is to examine them early on. There is not much difference between them in general beyond aesthetics, other than URL being marginally more manual-debugging-friendly and easier to communicate (you can't put headers in a hyperlink), and headers being cache-friendlier (but we do want to split the cache on language, so it doesn't matter).

The parsoid endpoints are currently using Accept-Language to trigger language variant conversion. Page content in the user language is not supported at all in Parsoid, afaik. We will have to find a solution for that soon, I suppose. Not sure how it handles the language of error messages...

I don't think the REST API supports error message localization at all.

By "page content in the user language" you mean passing a non-default language to the parser, right? For things like {{int:lang}} or <translate>?

it would seem surprising and confusing if we had two systems for specifying language/variant that worked differently from each other.

Unfortunately, we do have two systems, and they do work differently.

I don't think the REST API supports error message localization at all.

It does, see for example https://de.wikipedia.org/w/rest.php/v1/page/as,djfasdkj/with_html

By "page content in the user language" you mean passing a non-default language to the parser, right? For things like {{int:lang}} or <translate>?

Yes, exactly.

The output may depend on two languages: the page's content languages (with variant applied) and the user's interface langauage (possible overwritten with uselang). I don't see a way to do this with a header.

In the task description, you write:

In theory, ServiceWiring should not use RequestContext. But also, in theory, services should not directly call RequestContext::getMain() themselves, which is static coupling and makes unit testing hard. There is no way today to avoid both and I don't think there's much consistency in which one a given piece of code prioritizes avoiding.

The way to avoid both is to pass the language as a parameter to the service, directly or indirectly. This would typically take the form of an IContextSource being passed to a factory method, so the thing returned by the factory has access to a context.

I would hope we can fix any service objects that rely on the interface language, rather than design our REST API around their shortcomings.

It does make sense to me use think about the usage of Accept-Language in the REST API, but it also seems like the "desired language" may may very different things to different endpoints: e.g. an endpoint returning rendered page content could use it for variant support (this is what parsoid does in RESTbase, and we'll probably port that to core); but an endpoint returning interface messages for use on the clinet side would interpret it as the desired interface language, and apply language fallback rather than variant conversion. But for that endpoint, perhaps the language should be a path parameter, and not come from a header? I see no one size that would fit all.

Btw, returning error messages in localized for to the client removes a big burden from the client side code. The need to inform the user about an error in their own language is very common, and it's next to impossible to do right on the client side, at least without making another request to the server (which would specify the user language). I could imagine an Error-Language header that could be used for this.

Perhaps we should have a mechanism in the REST framework that ensures that localized error responses can never be cacheable. But 4xx responses aren't commonly cached anyway, right?

I will just throw into the mix that localized error messages aren't always very useful/used. I feel it useful to call the two languages supported by MW the "user interface language" and the "content language" (which can be set on a page-by-page basis by eg the translate extension, but has a per-wiki default). Often extensions will spit out error messages in the "content language" not the "UX language" and indeed i think that's not an obviously incorrect choice: extensions appear in the content area, it makes sense to use the content language. *Most* of the use of the user interface language is in the skin, not the content area.

Of course there are exceptions, esp where workflow is implemented with templates or in multilingual wikis where editors are explicitly authoring "UX" not "content". And there are some (not really well defined) mechanisms to do that. But I think those should be treated as exceptions, not the rule, and if they end up uncatchable as a result so be it.

And as noted, content and UX also differ in their treatment of variants. UX messages are manually localized into variants (possibly with a fallback chain) so there's an explicit localized message for (eg) zh-cn which we use. Content is always "pan lingual" so we need to invoke language converter to get zh-cn output from the content area (which is thought of as "ZH" but in reality is a random mix of all possible ZH variants interleaved, and language converter is supposed to sort the mix out.

Accept-Language in the content APIs affects the variant of the content because in general (exceptions noted above) the content APIs don't contain "UX" content.

It would probably make more sense for the top level "with skin" apis to have accept language affect the "UX" language, and that is what uselang currently does I believe. I think there's also a "variant" url parameter that can be specified as well, but that's probably a misfeature rather than a feature. The fully-specified variant name always implicitly defines a base language name as well, so there's no need to specify both. (I vaguely recall that this wasn't always the case and i had to fix an instance of duplicate variant names back in the dark ages, which is why historically we might have had separate uselang and variant parameters.)

The way to avoid both is to pass the language as a parameter to the service, directly or indirectly. This would typically take the form of an IContextSource being passed to a factory method, so the thing returned by the factory has access to a context.

On paper, maybe. In reality the language will be needed whenever a Message object is stringified, which is basically everywhere, and it's not reasonable to add a language parameter to every public method in every interface. Removing hidden dependencies on user language is a good long-term goal, but I don't think we are anywhere near it.

It does make sense to me use think about the usage of Accept-Language in the REST API, but it also seems like the "desired language" may may very different things to different endpoints: e.g. an endpoint returning rendered page content could use it for variant support (this is what parsoid does in RESTbase, and we'll probably port that to core); but an endpoint returning interface messages for use on the clinet side would interpret it as the desired interface language, and apply language fallback rather than variant conversion. But for that endpoint, perhaps the language should be a path parameter, and not come from a header? I see no one size that would fit all.

I don't think those cases are conceptually that different. Certainly not from the caller's point of view: you pass a BCP-47 language tag, you want to get the response in a language as close to it as possible. Whether that involves some kind of transformation is an internal detail (which can be pretty different from language variant to language variant - some involve the same kind of "include multiple languages in wikitext and select the right one" logic as Commons templates, just with different wikitext markup; some do fully automatic transcription).

I'll also add that it's hard for me to imagine a future for MediaWiki where machine translation (during reading, not as an authoring helper tool) won't be involved. The strategy forum uses machine translation for discussion, and MediaWiki not supporting something similar is clearly seen by people as a shortcoming. So I don't think automatic conversion of content is conceptually limited to variants of the same language, we just don't support it for other things currently.

More pragmatically, if we want to set the user langauge (RequestContext::getLanguage()), which in an ideal world wouldn't be a thing, but I don't think we can get rid of it soon – that cannot be a concern left to individual API handlers, as they are probably invoked too late for that.

Perhaps we should have a mechanism in the REST framework that ensures that localized error responses can never be cacheable. But 4xx responses aren't commonly cached anyway, right?

5xx responses should not be cached. For 4xx it's not uncommon (with suitably short expiry), and it's useful when generating the error response is not cheap (e.g. article gets renamed without redirect during a massive traffic spike, and then you can't afford doing a DB lookup for every 404).

I will just throw into the mix that localized error messages aren't always very useful/used. I feel it useful to call the two languages supported by MW the "user interface language" and the "content language" (which can be set on a page-by-page basis by eg the translate extension, but has a per-wiki default). Often extensions will spit out error messages in the "content language" not the "UX language" and indeed i think that's not an obviously incorrect choice: extensions appear in the content area, it makes sense to use the content language. *Most* of the use of the user interface language is in the skin, not the content area.

Should an API really have to know whether it's being used to construct UI or content, though? It seems like the caller's responsibility to decide which language to use (and as I said above we should move away from the user preferences being invisibly applied on the server side anyway, as it makes the API uncacheable).

Of course there are exceptions, esp where workflow is implemented with templates or in multilingual wikis where editors are explicitly authoring "UX" not "content". And there are some (not really well defined) mechanisms to do that. But I think those should be treated as exceptions, not the rule, and if they end up uncatchable as a result so be it.

I don't think it's reasonable to call them exceptions; it's how multilingual wikis work. That there doesn't exist a non-hacky mechanism for multi-linguality in wikitext that's too dynamic for the Translate extension is a shortcoming of MediaWiki, not a shortcoming of the use case. And API responses for Commons ending up uncacheable (not sure if that's what you meant to say, but that would be my concern about treating it as an exception) is not really an acceptable outcome IMO.

It would probably make more sense for the top level "with skin" apis to have accept language affect the "UX" language, and that is what uselang currently does I believe.

Do we want to have "with skin" REST APIs? I would instead expect separate API calls used for content and UI (which have very different caching characteristics), with the client piecing its own "skin" together.

I think where that leaves us is that Accept-Language (or whatever mechanism we end up with for "main" language) should both set user language (ie. RequestContext::getLanguage()) for parsing and such, and apply language conversion if that makes sense for the given endpoint. And the error language should be set with a separate mechanism, and ignore RequestContext::getLanguage() (which I think it thankfully does, as REST handlers use the new MessageValue mechanism, not Message).

I think there's also a "variant" url parameter that can be specified as well, but that's probably a misfeature rather than a feature.

I would be nice to use a single BCP-47 language tag instead of separate base language + variant. But we would have to make sure that language tag is easily accessible to the client, which I don't think is necessarily the case currently.

Here are some loose thoughts from a Platform Team discussion on this topic today. (If anyone in that discussion sees anything I misrepresented, please correct me.) I have not yet carefully read the preceding two replies, so some of this may contradict them or have already been answered. Just wanted to get these thoughts posted before I lose the sticky note that I recorded them on. :)

  • making service initialization not depend on the request is a stated, intentional goal. Some things discussed on this task would be contrary to that goal
  • mediawiki bootstrap (as it applies to this task) goes basically like this:
    1. load config
    2. load extensions
    3. service initialization
    4. initialize session
    5. load user
    6. load user preferences
  • nothing that involves user preferences can happen until #6 above, which puts it after all those other things. (That doesn't mean, of course, that we couldn't do some things with language earlier and others later, but then we're spreading out language-related code.)
  • some of our questions may be solvable via endpoint design. Rather than making endpoints handle multiple languages (ex UX and content), can we split into multiple endpoints, each of which only deals with one language (ex. separate endpoints for content and UX)? There are well-known cases where this wouldn't work (such as were ux is embedded in content so you can't realistically ask for it separately), but are there so many that the couldn't be handled as exceptions rather than making language handling more complex for all endpoints?
  • errors can happen very early, and outside the "normal" code flow, so error language may deserve different treatment than content/ux language
  • a quick log search suggests that about 600 of 10.5 million Action Api requests used the errorlang parameter. That's about 0.006%. So people do use it, but apparently not much.
  • as mentioned in previous comments, Accept-Language is not very cache-friendly. Could the edge cache be made aware enough of the relationship between request and response to mitigate this by splitting the cache only on relevant differences and not on every permutation of Accept-Language?
  • as mentioned in previous comments, headers are generally more cache friendly than query parameters

@Tgr said:

The action API defaults to the user's language preference for userlang (and then defaults to the value of uselang for errorlang). I think that was a mistake; it makes all logged-in API requests uncacheable. That's not that tragic for the action API which has a cache-unfriendly URL structure anyway (although T138093: Investigate query parameter normalization for MW/services, which I think @ori is now working on, could fix that), but caching is supposed to be the core strength of the REST API. I would rather shove the responsibility of always setting the language headers/parameters to the client.

Reading back over the comments, @Tgr, can you clarify exactly which part you think was a mistake? My understanding of your meaning is that falling back to user preference means that, for logged in users, the response does not predictably depend on the request and the response is therefore uncacheable, and that this could be avoided in the REST API by not falling back to the user preference. Presumably, if the request did not specify the language, we'd fall back to the project language . Did I understand correctly, or did you mean something else?

If I did understand correctly, wouldn't other user preferences influence the response, thereby at least sometimes still rendering responses for logged-in users uncacheable? I'm wondering if we'd introduce confusion by honoring some user preferences but not others, for insufficient gain. But maybe language is significant enough and/or other preferences affect a small enough number of requests to justify handling language differently?

Of course there are exceptions, esp where workflow is implemented with templates or in multilingual wikis where editors are explicitly authoring "UX" not "content". And there are some (not really well defined) mechanisms to do that. But I think those should be treated as exceptions, not the rule, and if they end up uncatchable as a result so be it.

I don't think it's reasonable to call them exceptions; it's how multilingual wikis work. That there doesn't exist a non-hacky mechanism for multi-linguality in wikitext that's too dynamic for the Translate extension is a shortcoming of MediaWiki, not a shortcoming of the use case. And API responses for Commons ending up uncacheable (not sure if that's what you meant to say, but that would be my concern about treating it as an exception) is not really an acceptable outcome IMO.

I know embarrassingly little about how Commons handles multiple languages. I think you're referring to (at least) the sort of thing described here, and maybe also here. Am I on the right track, or am I misunderstanding? Are there any particular docs I should look at to better understand?

  • mediawiki bootstrap (as it applies to this task) goes basically like this:
    1. load config
    2. load extensions
    3. service initialization
    4. initialize session
    5. load user
    6. load user preferences

The last two steps actually happen on demand, whenever the current user is accessed (usually via RequestContext::getMain()->getUser()). If it is accessed before session initialization, which can happen in some hooks, an anonymous user / default preferences are silently returned. Fun times.

Reading back over the comments, @Tgr, can you clarify exactly which part you think was a mistake? My understanding of your meaning is that falling back to user preference means that, for logged in users, the response does not predictably depend on the request and the response is therefore uncacheable, and that this could be avoided in the REST API by not falling back to the user preference.

Exactly.

Presumably, if the request did not specify the language, we'd fall back to the project language.

Arguably to English for error language (or maybe that should be in multiple languages in the first place); to content language for the other kind(s?) of language.

If I did understand correctly, wouldn't other user preferences influence the response, thereby at least sometimes still rendering responses for logged-in users uncacheable? I'm wondering if we'd introduce confusion by honoring some user preferences but not others, for insufficient gain. But maybe language is significant enough and/or other preferences affect a small enough number of requests to justify handling language differently?

Yes, that's a potential problem for user preferences, and also for user privileges (a more common problem, I think, since most of our APIs are content-related and most content is affected by privileges in some way, revdel/oversight if nothing else); both can influence behavior deep down in the bowels of MediaWiki, in some service dependency the API handler might not even be aware of. Making things uncacheable for logged-in users is how the action API handles it, and the web interface as well. That results in a fairly significant performance penalty for logged-in users who are always served by an appserver, never by edge cache. In areas which are far from our main data centers rely more on edge cache (ie. most of the world that's not the US), it's even more significant. Moving away from that situation should be a long-term goal IMO, and the plan that's usually given is moving away from the PHP frontend layer and relying on client-side composition of the UI from API calls (in the glorious future where we can run the same JS code in the client and the server, so we have a non-JS fallback). Those API calls then need to be cacheable, at least the ones which need to be repeated every time the user visits a new page (while the ones specific to the user and not the content would be remembered by the client across requests).

This is certainly not a short-term concern, but also not something where it's easy to course-correct later, I think. (Although maybe this is the kind of thing API versioning is for, and we shouldn't worry too much about it for now?)

I think you're referring to (at least) the sort of thing described here, and maybe also here.

Yes, https://commons.wikimedia.org/wiki/Commons:Localization#Content_internationalization_methods specifically. Which, as you can see there, is a mess – they use several different methods (including making the MediaWiki parser split by user language; including all translations in wikitext and letting client JS select the right one; or relying on Wikibase language logic), whichever the author of the given template likes best. The one that's relevant to this discussion is the parser-based one (used in {{LangSwitch}} IIRC) - creating a user-defined MediaWiki system message (MediaWiki:Lang) which is just the language code, adding it to $wgForceUIMsgAsContentMsg (a good contender for worst named MediaWiki configuration variable), and then using {{int:lang}} in wikitext which will resolve to the user's language (and split parser cache). It's a horrible hack. Also, most of the Commons UI relies on it.

Special:MyLanguage (the Translate extension's solution for showing content in the user's language) is also something that would be nice to make cacheable in the long-term future.

FWIW: Special:MyLanguage was moved MediaWiki core a while ago.

There is also T58464: Allow anonymous users to change interface language on Wikimedia wikis with ULS to address a big gap in functionality. Maybe there could be some benefit if the solution for both ULS and REST API could be the same. It seems a cookie would be the solution for the former in the great future where it is possibly to vary by a specific cookie.

I'm going to make another attempt at summarizing discussion and making a proposal, based on comments on this thread, plus some of my own thoughts. Thoughtful disagreement is encouraged.

SUMMARY
(sorry, this is long, but we've talked about a lot)

We have the following distinct things can be affected by language choice:

  • content language (the language used for the main content of the page)
  • UX language (user-facing language for the non-content portion of the page, mostly skin)
  • API interface language (aka error messages, which are not necessarily user facing, but could be)

There are some complex cases (such as templates on Commons, or certain Parsoid needs) that mix those things. A general case solution to mixed things is challenging.

The Accept-Language header is the standard way of specifying language. It has strengths and weaknesses for our situation.

Pros:

  • Standard, well-known header
  • Supported by browsers
  • Includes a fallback mechanism
  • Generally more cache friendly than a query parameter

Cons:

  • Insufficient for the totality of our language specification needs
  • Fallback mechanism does not match Mediawiki's fallback mechanism, resulting in confusing behavior that could be considered a bug
  • Artificially splits the cache, which could be even worse than no caching (if it interferes with existing caching)

The Accept-Language header (if we use it) need not mean the same for every endpoint. Different endpoints could interpret it in different ways. For example, an endpoint that provides content could understand it to refer to content language and an endpoint that provides UX could understand it to refer to the UX language. The consistent thing would be that Accept-Language specifies the language used by the endpoint for whatever the endpoint's main function is.

If we choose to deviate from Accept-Language's fallback mechanism (by using our own existing one instead) we could document this. Alternatively, if we decide that using Accept-Language would mean we must faithfully adhere to its standard fallback system, then we cannot realistically use Accept-Language.

We would prefer that endpoints return either content or UX, not both. This means they need only one (non-error) language to be specified. Accept-Language would therefore be sufficient for both the content and UX cases.

Even if we decide to use Accept-Language for content/UX (aka "main language"), it is not sufficient to also handle the "API interface language" (aka error language) in the same request. This is because callers may want to specify different languages for the "main language" vs the "error languge". However, if the Action API's errorlang can be used as a guide, then specifying a different error language is very rarely used (a fraction of a percent of calls). This is rare enough that it might be acceptable for them to be uncacheable (although we should still prefer them to be cached when it makes sense to do so).

Implementation may be tricky, whatever we decide. MediaWiki bootstrap/service initialization is moving in a favorable direction (away from global state and towards dependency injection), but that transition is not complete.

We may or may not want to respect the language preference for logged in users. Respecting it is problematic for caching, ignoring it while still respecting other user preferences could be confusing. In general, caching for logged-in users is desirable, but challenging, and deserves careful thought. However, as long as whatever we decide on language specification does not limit our possibilities, we can consider that question separately.

We do not necessarily need to implement every bit of whatever we plan all at once. Incremental implementations may be possible.

Providing a general mechanism for language handling does not prohibit special-case handling by individual REST endpoint handlers that have an unavoidable need to deviate. Some Parsoid endpoints that need to separately specify the content and UX languages may need to do this.

We should keep in mind that our decisions may affect third parties differently than WMF.

PROPOSAL FOR LANGUAGE SPECIFICATION

  • in the general case, use Accept-Language for all non-API-interface language specification (in practical terms, that means use Accept-Language for everything except error language)
  • document that MediaWiki may use its own fallback mechanism (instead of Accept-Language's)
  • in the special case of endpoints that need to separately specify content and UX language, let the endpoint define its own behavior
  • document that endpoints should generally return either content or UX, and therefore generally only need one language to be specified
  • use a custom Error-Language header for specifying API interface language (aka error language)
  • document that callers should only specify Error-Language when truly needed, as it may have negative effects on performance/caching

Note: I intentionally did not propose anything regarding implementation details, as I wanted to first see how much consensus (if any) we have on the "what".

Seems like a good proposal to me (but it would be nice to get wider feedback at some point).

Wrt caching: ideally we want to split cache on the first language only, as opposed to the full header. Sites with an edge cache (including Wikimedia) should be able to do that easily (as long as we clearly document the need). Sites without an edge cache probably don't have much reason to care about it: the site's own infrastructure won't be affected, the browser's caching ability won't be affected since it always sends the same languages, and any midway caches (probably operated by the ISP or the phone/browser vendor) can deal with that much extra storage need. So on second thought I think we'd be fine there (with the help of some custom Varnish logic).

Wrt error language, I think we should always include the error in English (lingua franca for software development), and the language(s) specified to the API (ie. the accept language, or both the content language and the user language in the rare case where the API lets both be specified separately). At that point the error language should be mostly unnecessary; maybe completely unnecessary and we could just omit it entirely.

Seems like a good proposal to me (but it would be nice to get wider feedback at some point).

Agreed. In particular, I'd like to hear from the people who are actually responsible for our caching infrastructure. I'll see if I can get some of those voices in here.

Wrt error language, I think we should always include the error in English (lingua franca for software development), and the language(s) specified to the API (ie. the accept language, or both the content language and the user language in the rare case where the API lets both be specified separately). At that point the error language should be mostly unnecessary; maybe completely unnecessary and we could just omit it entirely.

I really like that idea.

Should an API really have to know whether it's being used to construct UI or content, though? It seems like the caller's responsibility to decide which language to use (and as I said above we should move away from the user preferences being invisibly applied on the server side anyway, as it makes the API uncacheable).

Yes, that is how MediaWiki is architected. It's a parser option.

If your API *only* returns a content area, then sure use the content language. If your API *only* returns skin content, then use the interface language. But if your API can possible return a combination of the two, then you really need to take and expose both language parameters.

Of course there are exceptions, esp where workflow is implemented with templates or in multilingual wikis where editors are explicitly authoring "UX" not "content". And there are some (not really well defined) mechanisms to do that. But I think those should be treated as exceptions, not the rule, and if they end up uncatchable as a result so be it.

I don't think it's reasonable to call them exceptions; it's how multilingual wikis work. That there doesn't exist a non-hacky mechanism for multi-linguality in wikitext that's too dynamic for the Translate extension is a shortcoming of MediaWiki, not a shortcoming of the use case. And API responses for Commons ending up uncacheable (not sure if that's what you meant to say, but that would be my concern about treating it as an exception) is not really an acceptable outcome IMO.

*Generally* the Translate extension works on a *page* basis. A *page* has a given content language, which may or may not match the wiki's content language. You can use transclusions to construct a multilingual page, but the vast majority of our pages do *not* mix multiple language content together in one page.

The main exception, as you point out, being "user interface" code implemented with templates, where the template has all languages mixed together and dynamically switches between them, as was historically done with Commons. But this is pretty broken if you look at the fundamental architecture (T114640, but see also T308487 T109705 T68051). We need to support this better, but the long-term solution (IMO) is to hoist such "interface markup" out of "content pages", and probably to allow localized templates as a first-class feature (T238411). And while multilingual fragments do complicate "fragment rendering" -- I believe that eventually fragments should also have their own content/UX language context, which might not match the parent --- this doesn't change the fact that the core parser only supports "content language" and "user interface language" and all the dynamic stuff happening is keyed off those two languages.

When I said "treat it as an exception" I'm mostly talking about the fragment case -- it's possible that *some* APIs will need to pass "(page) content language", "user interface language", *and* "currently active target language" (distinguished from the page content language) in order to handle those cases where content-handling code needs to swap into a different language to render a fragment -- but that should be an exception /from the API standpoint/, aka an infrequently used parameter, and the majority of our APIs should accept just one or both of the canonical pair of "(page) content language" and "interface language".

Do we want to have "with skin" REST APIs? I would instead expect separate API calls used for content and UI (which have very different caching characteristics), with the client piecing its own "skin" together.

Long-term I agree. In the short term, we might not be able to point the oceans immediately. DiscussionTools and MobileFrontend rely pretty heavily on the "with skin" APIs IIRC and in particular there are OutputPage hooks which are only triggered when "with skin" is passed. Basically I think of "with skin" as being more "I want to trigger OutputPage hooks" than "I want a skin", since there's a "no skin" skin I can pass to get "content with OutputPage hooks triggered".

I think where that leaves us is that Accept-Language (or whatever mechanism we end up with for "main" language) should both set user language (ie. RequestContext::getLanguage()) for parsing and such, and apply language conversion if that makes sense for the given endpoint. And the error language should be set with a separate mechanism, and ignore RequestContext::getLanguage() (which I think it thankfully does, as REST handlers use the new MessageValue mechanism, not Message).

Again, I don't think it's useful to refer to this as "error language". *User interface language* is better, if the setting can affect stuff returned as "content", and I think that's clearer even if you're talking about localizing HTTP error codes.

I think there's also a "variant" url parameter that can be specified as well, but that's probably a misfeature rather than a feature.

I would be nice to use a single BCP-47 language tag instead of separate base language + variant. But we would have to make sure that language tag is easily accessible to the client, which I don't think is necessarily the case currently.

I'm not sure what this means?

PROPOSAL FOR LANGUAGE SPECIFICATION

  • in the general case, use Accept-Language for all non-API-interface language specification (in practical terms, that means use Accept-Language for everything except error language)
  • document that MediaWiki may use its own fallback mechanism (instead of Accept-Language's)
  • in the special case of endpoints that need to separately specify content and UX language, let the endpoint define its own behavior
  • document that endpoints should generally return either content or UX, and therefore generally only need one language to be specified
  • use a custom Error-Language header for specifying API interface language (aka error language)
  • document that callers should only specify Error-Language when truly needed, as it may have negative effects on performance/caching

I think you still need to clearly distinguish "user-independent responses" from "user-dependent responses". Parsoid was deliberately engineered so that it's first level "cacheable" HTML does not depend on any properties of the User; to the extent that user preferences are reflected in the result (including message localization) this is done in a post-processing pass. Therefore the main Parsoid API response is "user independent". Further, the first-level Parsoid HTML depends only on the page language, and not the current variant etc so the first-level Parsoid API responses doesn't need to take a language parameter at all. Then there are two additional end points: one does language conversion -- it takes a single parameter, which is currently passed using Accept-Language and is the BCP47 language code of the desired variant. This is still cachable, and still user-independent. In this case Accept-Language sets the content language, not the UX language.

However, most legacy endpoints are user-specific. The action=parse endpoint of the legacy action API will (eventually) return /user-specific/ Parsoid HTML which will be generated via postprocessing the cached user-independent Parsoid HTML. It needs to be explicitly designed into the API if the results are not to depend on a particular user.

Probably the smallest possible patch to your proposal is something like:

  • in the special case of endpoints that need to separately specify content and UX language or depend on user preferences, let the endpoint define its own behavior
  • document that endpoints should generally be user independent and return either content or UX, and therefore generally only need one language to be specified
  • document that callers should only specify depend on user preferences when truly needed, as it may have negative effects on performance/caching

I think the broad stroke API guidance I'm gesturing at is that rather than implicitly depending on all preferences of a user, it is better to put the burden of user preference lookup on the client and APIs to be constructed to narrowly pass those specific options (including user interface language) as headers/parameters. /foo?userlang=de&variant=de-x-formal&thumbsize=200 is cacheable, /foo?user=Bar is not.

(Worth noting that this particular API choice biases for cacheability at the expense of flexibility. If we add some future new "blah" preference to user, it will appear that "user preferences are being ignored" until every single client which could possibly be affected is refactored to explicitly pass &blah=yes/no in its API responses. I personally think that's the right tradeoff, but it is is debatable.)

Sorry for slow response, I was on vacation last week.

I think you still need to clearly distinguish "user-independent responses" from "user-dependent responses".

I like that phrasing. This task is specifically about language selection, but in some ways that's just a special case of user dependence and it'd be preferable for whatever we come up with for language selection to at least not be at odds with the larger considerations.

@Tgr mentioned this in an earlier comment, but for some endpoints responses could depend on not only user preferences, but also user privileges, which brings in another layer of complexity.

I think the broad stroke API guidance I'm gesturing at is that rather than implicitly depending on all preferences of a user, it is better to put the burden of user preference lookup on the client and APIs to be constructed to narrowly pass those specific options (including user interface language) as headers/parameters. /foo?userlang=de&variant=de-x-formal&thumbsize=200 is cacheable, /foo?user=Bar is not.

That makes sense for Parsoid's use case. Given that that REST API is an extensible framework that who-knows-what might be built on, including in third-party installs by custom extensions, there may be other situations where different tradeoffs are preferred. Your suggested change to the proposal allows for that, while leading people toward preferred solutions where possible.

Putting that all together, calling out user privileges in addition to user preferences, and incorporating Tgr's suggestion on API interface (aka "error") language, I think we have:

PROPOSAL

  • in the general case, use Accept-Language for language specification
  • document that MediaWiki may use its own fallback mechanism (instead of Accept-Language's)
  • in the special case of endpoints that need to separately specify content and UX language, or be user dependent, let the endpoint define its own behavior
  • document that endpoints should generally be user independent and return either content or UX, and therefore generally only need one language to be specified
  • document that callers should only be user dependent when truly needed, as it may have negative effects on performance/caching
  • return errors in English and any additional language specified by Accept-Language (per MediaWiki's fallback mechanism)
  • document that endpoints that separately specify content and UX language should return errors in all of: English, content language, UX language
  • allow a custom Error-Language header for specifying API interface language (aka error language). This allows specifying an additional language in which errors are returned, without suppressing the default behavior.
  • document that callers should only specify Error-Language when truly needed, as it may have negative effects on performance/caching

NOTES

  • user dependence includes both user preferences and user privileges. Where user preferences affect the response, it is preferred for the client to pass them as parameters. This makes the endpoint depend only on the parameters passed, rather than actually depending on the user. This facilitates caching.
  • for security reasons, user privileges must be verified on the server side rather than being passed by the client. This may have negative caching implications.
  • while cache-friendly implementations are strongly preferred, handlers should not assume any particular caching behavior. In particular, it is still the responsibility of handlers to be as efficient as possible, even if they are also cache-friendly.

Final thought: I'm tempted to drop the Error-Language bit completely, as it is starting to feel unnecessary, and the equivalent functionality in Action API is almost never used. Nothing in the proposal prohibits adding Error-Language later should we discover that we do indeed need it.

@BPirkle I'm throwing in a couple random thoughts I have from our conversation. I haven't read the whole conversation above, so apologies if these appear a bit disconnected or repeat things that were already said.

First, about the pros and cons of forcing the client to provide the language explicitly even when they only want to use the user's language. I'm not sure how a thing such as "use the user's language" would be coded (special header, special value of some header etc.), but in general it seems something extra that would have to be special-cased. So I think maybe, at first, we could disallow it. Adding it later on (assuming that people will need it) seems easier than adding it now and dealing with all the caching issues etc. Also, clients have a way to grab the user preference anyway.

As for returning raw message keys instead of formatted errors: I'm not sure if it's a good idea. It might be acceptably good in the context of MediaWiki, but I'm not sure if it'd really be useful for other clients. Even in the most basic case (simple message with no parameters and no formatting), they'd still need to make another API request (to api.php, action=query&meta=allmessages) to obtain the message text, which could be cumbersome. If you start factoring in things like different message formats (e.g., whether to parse it as wikitext), and then parameter replacement, with many different parameter types, I'm not sure how clean the end result would be for the client -- unless the client is Mediawiki, that is. I guess everything should be possible by using meta=allmessages, but possible doesn't mean good.

@BPirkle I'm throwing in a couple random thoughts I have from our conversation. I haven't read the whole conversation above, so apologies if these appear a bit disconnected or repeat things that were already said.

First, about the pros and cons of forcing the client to provide the language explicitly even when they only want to use the user's language. I'm not sure how a thing such as "use the user's language" would be coded (special header, special value of some header etc.), but in general it seems something extra that would have to be special-cased. So I think maybe, at first, we could disallow it. Adding it later on (assuming that people will need it) seems easier than adding it now and dealing with all the caching issues etc. Also, clients have a way to grab the user preference anyway.

As for returning raw message keys instead of formatted errors: I'm not sure if it's a good idea. It might be acceptably good in the context of MediaWiki, but I'm not sure if it'd really be useful for other clients. Even in the most basic case (simple message with no parameters and no formatting), they'd still need to make another API request (to api.php, action=query&meta=allmessages) to obtain the message text, which could be cumbersome. If you start factoring in things like different message formats (e.g., whether to parse it as wikitext), and then parameter replacement, with many different parameter types, I'm not sure how clean the end result would be for the client -- unless the client is Mediawiki, that is. I guess everything should be possible by using meta=allmessages, but possible doesn't mean good.

Thank you so much, @Daimona.
I know it will not be the best option for other clients, but as far as I can see it would be good to be used by our extension and any other extension that will use the API to make requests and wants to give an internationalization message to the users. I still think it would make sense to have this option as well, in my mind it would be easier, and it is just one more parameter to be returned by API, something like:

{
    "messageKey": "message_key",
    "messageTranslations": {
        "en": "A message in English"
        "fr": "Un message en Français"
    }
}

What do you think? @BPirkle @Daimona

Thank you so much, @Daimona.
I know it will not be the best option for other clients, but as far as I can see it would be good to be used by our extension and any other extension that will use the API to make requests and wants to give an internationalization message to the users.

I'm still not convinced that it is a good idea. If we implement something permanent (i.e., a long-term solution) in a way that we already know won't work that well for certain clients, that's a red(dish) flag for me. Our use case is not special in any way, and I believe that the framework should support all clients, not just MW clients. Sure, that solution could be implemented together with something else that works for everyone, but then I guess it would just be unnecessary extra work.

I still think it would make sense to have this option as well, in my mind it would be easier, and it is just one more parameter to be returned by API, something like: [...]

(emphasis mine)

No, it's not. You also have to include information about whether the message should be passed through the wikitext parser, and a list of parameters to replace into the message, each one with its own (optional) type information (plaintext, number, raw, date, list, etc.). Then the client would have to apply any transformation to the parameters and pass everything back to the allmessages module (whose capabilities are mostly unknown to me). This would probably already be sufficiently ugly in MW clients, without even having to think about other clients.

Adding some Traffic Team members for visibility and feedback per Slack discussion.

From a caching perspective splitting the cache by language has a negative impact on hitrate and fragmentation. Considering this is an API (hence not to be consumed by users directly) translating/duplicating the error message (English + content language) will impact negatively on response size and debuggability. Please note that server side errors (status >= 500) are never cached so those responses will always be returned from the API origin server through the CDN. Errors triggered by some kind of issue on the request (400 >= status <= 499) can potentially be cached, so splitting these by content and/or error language is also harmful.

Varying on Accept-Language Vary:Accept-Language requires some aggressive normalization to avoid hurting cache performance badly. As an example, the current normalization that it's performed for /api/rest_v1/ gets the first lang variant and ignores the rest, so something like Accept-Language: sv-SE,sv;q=0.9,en-US;q=0.8,en;q=0.7,da;q=0.6,nb;q=0.5 gets normalized to Accept-Language: sv-se (more context available in T195327 / https://gerrit.wikimedia.org/r/c/operations/puppet/+/434558).

It's also worth mentioning that the backend caching layer (ATS [Apache Traffic Server]) has a current limit of 5 alternates per cache key, so it would cache only 5 different values of Accept-Language assuming that's the only header listed on the Vary one.

My personal opinion is that the error language bit should be dropped for the reasons exposed above.
Regarding language selection I'd encourage an explicit language selection per request (something like /foo?lang=de). This will avoid:

  • Unexpected results from the client point of view due to unexpected Accept-Language header normalization
  • It will help portability of the API as it won't depend on custom optimizations implemented in WMF's caching layer.