Page MenuHomePhabricator

RFC: Support language variants in the REST API
Closed, ResolvedPublic

Description

Status: Accepted per ArchCom meeting 20170-03-01, based on the discussion on this task.

Currently there is no way to specify language variant via the REST API, and without knowing a specific language variant, Parsoid cannot produce exact same content as Wikipedia page
e.g.
https://zh.wikipedia.org/api/rest_v1/page/html/%E4%B8%AD%E5%9C%8B

it contains raw wikitext markup "-{ }-"

-{H|zh:繁体字;zh-cn:繁体字;zh-tw:正體字;zh-hk:繁體字;zh-mo:繁體字;zh-sg:繁体字;}-

what we want is: allow client to specify language variant explicitly, and Parsoid return the exact same content as it is on Wikipedia, e.g.
https://zh.wikipedia.org/api/rest_v1/page/html/%E4%B8%AD%E5%9C%8B?lang=zh-cn
will return content same as:
https://zh.wikipedia.org/zh-cn/%E4%B8%AD%E5%9C%8B

Requirements

  • Continue to expose the original content, without variant conversions applied (as is the case right now).
  • Additionally, offer content with variant conversions applied for read-only use cases.
  • Follow the general REST API philosophy:
    • Play well with caching.
    • Predictable and simple request construction.

Candidate solutions



### 1. Domains

The REST API is very much built around domains as the primary means of selecting project, storage & general configuration. As such, it would be fairly straightforward to assign separate domains to variants. Examples:

- `zh.wikipedia.org/api/rest_v1/..`: Un-translated content. Used for editing.
- `zh-cn.wikipedia.org/api/rest_v1/..`: Simplified Chinese. Read-only.
- `zh-tw.wikipedia.org/api/rest_v1/..`: Traditional Chinese. Read-only.

#### Considerations

- Wildcard certs are tied to a single sub-domain level, so introducing a second level for variants (ex: `cn.zh.wikipedia.org`) would not be easy.

#### Advantages

- Simple to implement in REST API, does not require Varnish changes

#### Disadvantages

- Requires new domains.
- Does not support listings of variants.

2. Path prefixes

Instead of using domains, use special path prefixes to select variants. The REST API currently uses /api/rest_v1/, which makes fitting variants into this scheme a bit awkward. T114662 proposes a scheme like /wiki-cn/, which could be adapted to /api-cn/.

The Chinese Wikipedia currently replaces /wiki/ with the variant, as in zh.wikipedia.org/zh-cn/Sometitle. Fitting the API into this scheme without conflicts is tricky. The best I can think of is zh.wikipedia.org/api/zh-cn/Sometitle.

Alternatively, a schema like https://{domain}{/variant}/api/rest_v1/ can also be used. Note the optional {variant} part. If it is missing, no variant is used.

Advantages
  • Closer to current usage on Chinese Wikipedia.
Disadvantages
  • Does not really support listings of variants either.
  • Overloads root path namespace, opening the door to conflicts or less-than-obvious variant path names.

3. Accept-language header

Use the standard accept-language header to select content languages. To avoid cache fragmentation, normalize the language-accept header in Varnish, so that only meaningful values are considered & varied on.

Advantages
  • Established standard (when using accept-language).
  • Usually, automatically does the right thing for reading (more common than editing).
  • For end user links, avoids sharing / construction of broken URLs (see several comments in T114662).
  • Avoids fragmenting the API documentation by language, but requires more documentation for API subsetting. Swagger can support the accept-language header with value dropdowns (as with accept).
  • Relatively easy to support across end points. Does not require URL layout changes.
Disadvantages
  • Can be harder to debug / less obvious.
  • Needs to be unset to be sure that content is editable. However, this is easy to do in XHR / fetch (CORS whitelisted).
  • Requires more documentation on supported languages in individual API end points.

Proposal

Accept-Language headers and paths are not mutually exclusive. Even when using path based selection primarily, we will want to set up redirects using Accept-Language. This suggests the following pragmatic approach for the REST API:

  • Start by supporting Accept-Language headers in the REST API.
    • Normalize Accept-Language headers in Varnish, and vary on it.
    • Document and support Accept-Language header use in REST API.
  • Consider adding explicit URLs at a later point, once / if we have established a uniform language selection URL scheme (see T114662). For caching purposes, URL requests can be rewritten to Accept-Language requests, or vice versa.

See also

Related Objects

StatusSubtypeAssignedTask
InvalidNone
Resolved GWicke

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

there is also a cantonese-specific wiki (yue.wikipedia.org)

This is https://zh-yue.wikipedia.org/, in line with RFC 5646.

That is, the zhwiki community can elect to support a cantonese variant (yue) and there is also a cantonese-specific wiki (yue.wikipedia.org).

From a user perspective, having several different wikipedias cover the same language & variant sounds pretty confusing. It is not clear to me which user benefits would outweigh the costs of such duplication.

I'd like to make sure I understand this.

  • Candidate solution #1: Domains
    • assign separate domains to variants
  • Candidate solution #2: Path prefixes
    • use special path prefixes to select variants

Are these the only two viable choices with support, or are there others? @cscott: your comment suggests what appears to propose "candidate solution #3", and also "Alternatively...", candidate solution #4.

I updated the description in E168, suggesting that our 2016-04-27 meeting will be a "field narrowing" discussion. Let's endeavor to come away from that meeting with a clearer understanding of what the viable options are, and update the description of this task to make the choices clearer.

@RobLa-WMF I find candidate solution #1 a non-starter. The rest.wikimedia.org endpoints have been deprecated, and as outlined above there are some political reasons why multiple different wikis cover the same language codes, either natively or via variant conversion. (And consider simple.wikipedia.org, and what happens if both it and en.wikipedia.org get an en-gb or en-piglatin variant.)

I proposed "candidate solution #3" (special endpoint for variant conversion) and "candidate solution #4" (query string). Of these, @GWicke has indicated that query strings are problematic from a caching standpoint: the query parameters need to always occur in exactly the same order for the contents to be cached. We want variant content to be reliably cached.

So I think the remaining candidates are #2 and #3, and I believe that we should be able to reach consensus on one or the other.

To summarize @cscott's last comment, the solutions suggested in this RFC are:

  • Candidate solution #1 - Domains
  • Candidate solution #2 - Path prefixes
    • Possible?
  • Candidate solution #3 - special endpoint for variant conversion
    • Possible?
  • Candidate solution #4 - query string

Is that the consensus?

There are (at least) three different things that could be meant when a client requests X in languages foo and bar:

  1. give content about X in foo and bar. The content may be totally different, like for wikipedia pages about the same subject in different langauges.
  2. give me the translation of X in foo and bar. The content of the pages may diverge, but the intention is that they are faithful translations. This is what the translate extension does. It's how system messages and translatable help pages work.
  3. give me the content on X represented as foo and bar: this denotes a transformation, either of language neutral content (as on wikidata), or of content in a base language (as we do for language variants).

These three things should be distinguishable from looking at the URL we use to request them. Mixing them is going to be painful, I think. As far as I understand, this RFC is mainly concerned with the third point: the target language / variant transformation.

@daniel, your points 2) and 3) mostly overlap, I think. For example, the language variant conversion tool uses inline markup to select different content depending on the requested content language, not unlike the translate extension. The main difference is that one is more aimed at variants of a single language (and provides some general tools for this beyond what's available in translate), while the other limits itself to translations between major languages.

To a user interested in content in language (and variant) X-Y, the underlying mechanisms probably don't matter beyond whether the content is editable or not.

Your case 1) is currently handled by different domains & separate requests. We do not expose content from different projects in a single API.

We just finished discussing this in E168 (transcript: P2968). @cscott just posted this recap

Recapping some discussion in E168: RFC Meeting: Support language variants in the REST API (2016-04-27, #wikimedia-office), there's the question of whether the "target language" and "user interface language" need to be distinct and/or specified separately. In T114640: make Parser::getTargetLanguage aware of multilingual wikis we let {{int}} expand to label localized in the UX language, independent of the targetted language variant (which is set via the initial path prefix or user preference).

My strawman example is a user on zhwiki who has a target variant set to zh-hant but has the UX language (image metadata labels, {{int}} output, page UI) set to, say, de. Is this something we ought to account for? If you specify de alone, is language converter just turned off? (The result is an incomprehensible mix of character sets and variant terms.) Or do we fall back to some default (politics alert) and acknowledge this is nonideal but it's a corner case and unusual in practice?

In that meeting, @tstarling asked @GWicke to update the RFC to clarify what the viable alternatives we are considering (candidate solution #2 and candidate solution #3 above, I believe). @daniel had broader concerns about URL structure described in T122942#2244990, which @cscott agreed was a valid concern, so I suggested should be a separate RFC.

Any progress on this? This is blocking the apps from using RESTBase for zhwiki, as indicated by the three tasks above, which have this task mentioned. Until this is the case we would have to have effectively two code paths for retrieving page content: one for zhwiki and another one for the rest.

Yes, @cscott has a patch in gerrit that needs finalizing, review, merge, and deploy. Next steps are to fix bugs and enable the variant fetch endpoint. Timeline depends on what kind of things we trip on, but we expect some (if not all) parts of this to be deployed by end of year.

Excellent! Which one of the candidate solutions is it going to be (#2 or #3)? Is there another task where the patch is linked to? Or should it be linked here? I'm not sure about the latter since this is an RFC.

Yes, @cscott has a patch in gerrit that needs finalizing, review, merge, and deploy. Next steps are to fix bugs and enable the variant fetch endpoint.

That's good to hear.

Timeline depends on what kind of things we trip on, but we expect some (if not all) parts of this to be deployed by end of year.

This presumably means setting up APIs for those variants, as well as change propagation rules that update each variant's content after updates to the source language. We should have a conversation about this some time soon, so that we can adjust our planning for this quarter as well.

In T122942#2244988, @RobLa-WMF wrote (on 2016-04-27):

To summarize @cscott's last comment (T122942#2241839), the solutions suggested in this RFC are:

  • Candidate solution #1 - Domains
  • Candidate solution #2 - Path prefixes
    • Possible?
  • Candidate solution #3 - special endpoint for variant conversion
    • Possible?
  • Candidate solution #4 - query string

Excellent! Which one of the candidate solutions is it going to be (#2 or #3)? Is there another task where the patch is linked to? Or should it be linked here?

It would be helpful to have the link. I'm very interested in the result as well.

Great to hear there's forward progress on this!

https://gerrit.wikimedia.org/r/#/c/140235/ (and couple followups) are the WIP patches.

These patches don't yet implement any variant rendering API scheme -- that is the next step once the actual parsing of language variant markup and any associated tweaks are done. The actual API rendering scheme is a followup which will involve services and/or another RFC -- hence my qualifier "Timeline depends on what kind of things we trip on".

Sounds like we're still open to decide between #2 and #3. It looks to me that the two are quite similar in the sense that both basically require a different path in the endpoint from the standard structure. The difference is really where in the endpoint hierarchy they are located.

#2a) zh.wikipedia.org/api/zh-hans/rest_v1/html/{title}
#2b) zh.wikipedia.org/api/rest_v1/zh-hans/html/{title} (swapped rest_v1 and zh-hans)

#3a) zh.wikipedia.org/api/rest_v1/variant/html/zh-hans/{title}
#3b) zh.wikipedia.org/api/rest_v1/variant/zh-hans/html/{title} (swapped html and zh-hans)

(I hope I've translated GET /variant/html/{languagecode}/{title}/{revision}{/tid} correctly for #3a. ).

Is there another significant difference I missed?
The zh-hans is only an example. I'm not sure if it's really necessary to repeat the zh part.

We will discuss the various options in the Services team, and formulate a plan for exposing variants across all REST end points.

This problem is coming up quite regularly in different contexts, so we really need to move on this soon. A recent example (for the action API in that case) is https://lists.wikimedia.org/pipermail/mediawiki-api/2017-January/003882.html.

From the Reading perspective:

API consistency is pretty high on the list of priorities. Having a consistent way of constructing API URLs makes for cleaner code and simplifies the use of the API.

Saying that, placing the language variant in the domain makes the most sense:

zh.wikipedia.org/api/rest_v1/..: Un-translated content. Used for editing.
zh-cn.wikipedia.org/api/rest_v1/..: Simplified Chinese. Read-only.
zh-tw.wikipedia.org/api/rest_v1/..: Traditional Chinese. Read-only.

Currently, the iOS app must special case variants which added a lot of code complexity.

Less concrete… this also seems natural and understandable IMO.

I would propose to move on option 2 (path prefixes) with the variant being optional:

  • untraslated: https://zh.wp.org/api/rest_v1, https://zh.wp.org/wiki/Foo
  • simplified: https://zh.wp.org/zh-cn/api/rest_v1, https://zh.wp.org/zh-cn/wiki/Foo
  • traditional: https://zh.wp.org/zh-tw/api/rest_v1, https://zh.wp.org/zh-tw/wiki/Foo

I think we need to settle on a solution ASAP. What do you guys think? Can we have the ArchCom push this forward?

how about https://zh.wp.org/var/cn/api/rest_v1 and https://zh.wp.org/var/cn/wiki/Foo?

@Fjalapeno Variant in the domain is a nonstarter. We have single wikis which support multiple languages (and hence variants) like commons and meta. We also have political differences between projects over which should be responsible for a particular language (ie, is cantonese a "dialect" of zhwiki, or its own wiki?).

@mobrovac, @Arlolra It seems very odd to me that the variant comes at the root of the path, before /api/. Is that just to be "compatible" with the way path-rewriting works for the /wiki/ paths? But in fact the path is https://zh.wp.org/zh-cn/Foo -- it completely replaces the /wiki/ part. I don't think that's great. I don't think we need to be compatible with it.

All the options proposed by @bearND above seemed reasonable to me, although I'd not too fond of #2a. It seems that /api/rest_v1 ought to be at the root, since they give the essential context for the request (it's to the REST v1 API).

how about https://zh.wp.org/var/cn/api/rest_v1 and https://zh.wp.org/var/cn/wiki/Foo?

Why /var/ ? The variant itself can be easily deduced since the other possibilities that can appear at that level are fixed: api, w or wiki, neither of which correspond to any known language variant code in any standard.

@cscott I would prefer the variant to be at the root because that way we have a uniform way of expressing it for everything. /zh-tw/w/api.php for example. Then, it is up to the Action API to respect it or not. Furthermore, if we put it after /api/rest_v1/ then we have to find a way to cover /wiki/ and /w/*.php, which is non-trivial. Also, /wiki/zh-tw/Foo: is that an article named zh-tw/Foo or the article Foo in the zh-tw variant?

Why /var/ ?

It's shorter than /variant/ and explicit. It's also as many characters as zh- which was removed as redundant with the subdomain.

The variant itself can be easily deduced since the other possibilities that can appear at that level are fixed: api, w or wiki, neither of which correspond to any known language variant code in any standard.

But is that the full set we're ever going to want in the first position of the path? Some level of future proofing.

Why /var/ ?

It's shorter than /variant/ and explicit. It's also as many characters as zh- which was removed as redundant with the subdomain.

Hehe, my point was that IMHO it's superfluous altogether.

The variant itself can be easily deduced since the other possibilities that can appear at that level are fixed: api, w or wiki, neither of which correspond to any known language variant code in any standard.

But is that the full set we're ever going to want in the first position of the path? Some level of future proofing.

Probably not, but I find it hard to believe MW would have a need for an en-gb path :) Ofc, I might be wrong, but this is why I think we should put the full code of the variant, i.e. use zh-tw and not just tw, making it extremely unlikely the paths would ever conflict.

That said, I wouldn't be opposed to your /var/ idea as it makes things slightly more explicit.

There are variants which don't share the prefix of the host wiki. For example, the host language for commons is en, but LanguageConverter is enabled, and you can have pages where the page language is zh and you need to select a particular variant of that. commons.wikipedia.org/zh-tw/...

There's no guarantee that api won't eventually be a valid language code for WP; eg https://www.ethnologue.com/language/api . The whole /wiki/ prefix is really bad design in any case: meaningless and not localized. But it's the only "human exposed" part of the URL scheme. I think we can reasonably make it a special case---humans are worth it---and hide the parts important only to machines behind a more logical (if verbose) scheme. Let's not boil the oceans: it makes sense to me to concentrate on the part of the URL after the magic /api/ prefix and leave the harder problem of human-visible URLs for another day.

We're only looking for a solution for the REST API. User-facing pages and the Action API already have their own mechanisms to specify variant information. Let's not break what's working already.

@Fjalapeno Variant in the domain is a nonstarter. We have single wikis which support multiple languages (and hence variants) like commons and meta.

@cscott I'm not sure why this is a non-starter. Commons and meta are special cases of wikis which support languages differently than most other Wikipedia projects. Not sure why these special cases should dictate our design choices here.

We also have political differences between projects over which should be responsible for a particular language (ie, is cantonese a "dialect" of zhwiki, or its own wiki?).

@cscott Possibly, but dismissing it out of hand seems excessive. Nothing about this seems intractable if we think this is the better designed solution.

It still seems to me that we have an opportunity to design the API to be sensical from the ground up here with the REST API. It seems like a perfect time to shed some legacy issues that stem from the original design choices of how variants were implemented.

From what I can see the first option of using domains as @GWicke proposed makes perfect sense from a purely technical perspective.

I'm not saying I can't be convinced otherwise, but simply dismissing anything contrary to how things have always been AND political reasons don't seem like great reasons to inform the design here.

@Fjalapeno Variant in the domain is a nonstarter. We have single wikis which support multiple languages (and hence variants) like commons and meta.

@cscott I'm not sure why this is a non-starter. Commons and meta are special cases of wikis which support languages differently than most other Wikipedia projects. Not sure why these special cases should dictate our design choices here.

I also wouldn't define domains a non-starter, but @cscott does have a good point about these special wikis. If we were to use sub-domains, how would we tackle such wikis? Also, what about 3rd party users that would like to have the same functionality? We would have to come up with a different scheme for them. Note that while en.commons.wm.org might seem like a viable option, it really isn't because our TLS certificates do not cover such cases.

We also have political differences between projects over which should be responsible for a particular language (ie, is cantonese a "dialect" of zhwiki, or its own wiki?).

@cscott Possibly, but dismissing it out of hand seems excessive. Nothing about this seems intractable if we think this is the better designed solution.

Euh, politics, politics, politics. Even though we are here to make the best possible (technical) solutions to preserve and advance the communities and their projects, I would be reserved when it comes to decisions like this. It really is up to the communities to decide for themselves. I am not saying this is a good thing, but it's a social and organisational debt I don't think we can attack from a technical perspective.

It still seems to me that we have an opportunity to design the API to be sensical from the ground up here with the REST API. It seems like a perfect time to shed some legacy issues that stem from the original design choices of how variants were implemented.

+1, hence the long bike-shedding process :) This is also the reason why I don't agree with @cscott in:

We're only looking for a solution for the REST API. User-facing pages and the Action API already have their own mechanisms to specify variant information. Let's not break what's working already.

I fail to see how providing a consistent way of accessing language variants across the board breaks anything. As you point out, these already have their mechanisms to specify the variant, which means that all we would need to do it map the new layout to the existing functionality. IMHO, a small price to pay for achieving consistency.

To sum up, as previously stated, we really need to move on this ASAP, and we need to find a balance between having the right solution(TM) and advancement. I propose we move with @Arlolra's proposal of having en.wp.org/var/{variant}/[api|wiki|w]/....

Ping? Any objections/comments on:

I propose we move with @Arlolra's proposal of having en.wp.org/var/{variant}/[api|wiki|w]/....

Commons, wikidata and meta are interesting cases to consider. All of these can serve content in different languages, but there is actually no way to select the *content language* separately from the user interface language (uselang query string parameter). T114662 is discussing options for addressing this, with the main proposals being a path prefix, or some header based content negotiation. Also, note that none of these wikis is concerned with language *variants*.

This brings up a few questions in the context of this task:

  1. Should we distinguish between language variants and content language?
  2. Should we clearly communicate which languages are available to avoid cache fragmentation, or should we encourage content negotiation, and avoid fragmentation server side by normalizing requests in Varnish?
  3. Should we use URL-based addressing (path or domain), or less intrusive methods like headers?
  4. For API responses in particular, how do we clearly communicate which responses are editable?

My take on 1) is that I have a hard time imagining users actually caring about this distinction. I would much prefer to frame this consistently as a "content language" selection. If we go with path prefixes, I would prefer something like "/lang/" or "/content-language/" over "/var/".

For 2), it seems that a listing of supported languages would be desirable, but it would also complicate clients, especially those interacting with several projects. For most projects, only the regular project language will be supported, so in most cases only a single option will be returned. At least for the main language, we will likely want to automatically normalize language selections in Varnish, so that we would not fragment on an explicit request for the "en" content-language on en.wikipedia.org. From there, it seems like a relatively small step to fully normalize "accept-language" headers sent by most browsers. This header is CORS-whitelisted (as is accept), and is sent by default by all browsers, based on operating system and user preferences.

Resolving 3) (headers vs. URL) seems to be quite complex. Here are some pros / cons of using headers over URLs:

  • + Established standard (when using [accept-language](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language)).
  • + Usually, automatically does the right thing for reading (more common than editing).
  • + For end user links, avoids sharing / construction of broken URLs (see several comments in T114662).
  • - Can be harder to debug / less obvious.
  • +- Needs to be unset to be sure that content is editable. However, this is easy to do in XHR / fetch (CORS whitelisted).
  • +- Avoids fragmenting the API documentation by language, but requires more documentation for API subsetting. Swagger can support the accept-language header with value dropdowns (as with accept).
  • + Relatively easy to support across end points. Does not require URL layout changes.

While I didn't really consider the use of headers worth considering initially, I am increasingly warming up to the idea. Especially the ease of retrofitting header support without URL layout changes looks attractive.

On indicating editability (4): In some cases like wikidata or commons, it seems conceivable that all languages would be equally editable. In others (Chinese language variants), editing in a specific content language seems to be impossible to support. To allow clients to make the right call, the best I can think of at this point would be to a) hide edit end points in specific content languages (if we choose different URLs), or b) adding a response header indicating whether the content is editable. Another solution could be to always disallow editing when a non-default content language was selected. This would improve predictability for clients, but could cost some usability.

Another solution could be to always disallow editing when a non-default content language was selected. This would improve predictability for clients, but could cost some usability.

re "cost some usability", i guess it's not very hard to mitigate the cost by working with the community on something like browser plugin? @liangent

In an offline conversation @daniel mentioned the possibility of supporting both Accept-Language headers *and* path selection, at least for some applications mentioned in T114662. This could perhaps be a pragmatic way to make progress on this now:

  • Start by supporting Accept-Language headers in the REST API.
  • Consider adding explicit URLs at a later point, once / if we have established a uniform language selection URL scheme. For caching purposes, URL requests can be rewritten to Accept-Language requests, or vice versa.

I like that strategy. It allows us to progress on the issue at hand, while at the same time it doesn't prevent the REST API to adopt an URI scheme compatible with the outcome of T114662: RFC: Per-language URLs for multilingual wiki pages at a later time. Currently, the activities being blocked are related to programmatic clients, such as MCS, which are able to provide the language variant as a header.

3-2-1 sold?

I updated the description with a recommendation to start by using Accept-Language headers, reflecting the recent discussion.

Per the architecture committee meeting today this RFC is entering its one-week last call period. Please speak up now if you have any concerns or +1s to share. The ArchCom will review the discussion on this task in its meeting on February 23rd, and will either decide based on it, or extend the discussion if more time is needed.

I agree Accept-Language is the way to go for now. It preserves the orthogonality of *project* and *language* (ie, zhwiki/zh.wikipedia.org is the name of a project, like commons/commons.wikipedia.org and zhwikisource/zh.wikisource.org are; some projects have content in multiple languages/variants). It does (a) conflate content language and variant, as @GWicke mentioned in his point (1) above, and (b) conflates user-interface language with both of these (see discussion at T114662#2245197). Point (b) would be especially problematic for end-user links, but I think is unobjectionable for an API which is not intended to return user-interface elements of the page. (And end-user links have other issues with Accept-Language, as gwicke mentions above.) Point (a) ought to be *mostly* unproblematic: zh-yue is the exception here, as @GWicke mentioned above, since it is a "language" (on zh_yuewiki) but has the appearance of a zhwiki variant. However, the intent is to distinguish these: yue is the language code (see T30441: Rename zh-yue -> yue and InitialiseSettings.php where the language code for zh-yue.wikipedia.org is forced to be yue), and in mediawiki the variant on zhwiki is actually named zh-hk, not zh-yue (although RFC 5646 would allow either). So in practice Accept-Language: yue would be the language and Accept-Language: zh-hk would be the variant of zh. We do have to be a little careful about naively stripping the dash from a language code to get the base language; be-tarask is a language, not a variant of be; see https://meta.wikimedia.org/wiki/Special_language_codes for a some other similar exceptions, including de-formal.

Returning to point (a) briefly -- note that some templates on commons generate parts of the page which are considered "user interface" and should match the user interface language. That is discussed in T114662: RFC: Per-language URLs for multilingual wiki pages. PHP uses a cookie AIUI to set user interface language independent of the page language and variant selected, but this is a mess currently and what T114662 was primarily intended to address. I think that issue is separable from this task. I would certainly prefer to see the UI parts of commons content factored out into some other mechanism, so that content APIs like rest could remain blissfully ignorant of UI language and other UI preferences, and my understanding is that the current work to migrate commons metadata into wikidata will more or less accomplish this.

So let's go for Accept-Language and punt the harder URL path issues into task T114662.

+1

I updated the description with a recommendation to start by using Accept-Language headers, reflecting the recent discussion.

Per the architecture committee meeting today this RFC is entering its one-week last call period. Please speak up now if you have any concerns or +1s to share. The ArchCom will review the discussion on this task in its meeting on February 23rd, and will either decide based on it, or extend the discussion if more time is needed.

+1 for using Accept-Language header. That should work well for APIs.

Last week's TechCom meeting didn't happen due to a conflicting org-wide event, so this was bumped to today's meeting.

Based on the discussion in this task, the archcom decided to accept this RFC. This means that the REST API will support the selection of the language variant (and potentially the content language for multi-lang projects like commons and wikidata) using Accept-Language headers.

Thanks to everyone for contributing to this RFC discussion!

Next steps:

  • The Parsing-Team--ARCHIVED is currently finalizing language variant support in Parsoid, which is a precondition for exposing variants in the REST API.
  • Once this support has landed in Parsoid, we will work with Traffic to add Accept-Language support in the REST API, and document / enable this for projects with language variants.

Resolving this task, as the RFC was accepted. The actual implementation work is tracked in T159985.

T159985: Implement language variant support in the REST API may be a duplicate. Just mentioning it here so the two bugs get linked together at least.

Accept-language header seems to be a bad idea for situation when the language or variant is not usually selectable in browser or terminal setting (Like you can't pick anything yue in chrome language setting page. Not in Microsoft Windows setting either which IE seems to read from there. Not on smartphone setting which mobile browser and app read from either.) So there are no way for user to configure these client softwares to send accept language header in language/variant that they would like to use to the server.
[Note: This is relevant as there are request to implement Hans-Hant conversion for yue.wp too]
[Note 2: It can be a way to detect what variant the user initially want, but probably not a good way to fixate the variant selection based on this]

Edit: It is also bad in situation where user might temporarily want to access the page in another variant version.

Edit 2: Also, what will happen when user was using a guest machine that is configured with only English in language setting and nothing about chinese variant to access Chinese wp via the api?

@C933103 Just an FYI, this is only for APIs, not for user facing web URLs.

If I understand correctly, while this api should be internal to applications and not used by user, however it'd be used by things like mobile clients, visual editor, content scrapper, and such to obtain information for user's viewing which might still subject to some of the limitations I mentioned above?

@C933103 JavaScript AJAX requests can set the Accept-Language header. So special-purpose needs like mobile clients, visual editor, content scraper, etc will be able to accept the user's preference (with their choice of UX) and use the correct header when making the AJAX request to the backend API.