Page MenuHomePhabricator

Rest API endpoints do not respect Accept-Language header for some Wikis
Open, MediumPublicBUG REPORT

Description

Steps to Reproduce:

Make these calls with a latin language variant code in the Accept-Language header:

Accept-Language: ike-latn
GET https://iu.wikipedia.org/api/rest_v1/feed/featured/2021/03/23
GET https://iu.wikipedia.org/api/rest_v1/page/summary/%E1%96%83%E1%93%AA%E1%93%97%E1%93%88%E1%91%8E%E1%91%90%E1%91%A6
GET https://iu.wikipedia.org/api/rest_v1/page/mobile-html/%E1%96%83%E1%93%AA%E1%93%97%E1%93%88%E1%91%8E%E1%91%90%E1%91%A6

"extract": "ᖃᓪᓗᓈᑎᑐᑦ (English) —ᖃᓪᓗᓈᖅ|ᖃᓪᓗᓈᑦ ᐅᖃᐅᓯᓕᕆᔨ.",
"extract_html": "<p><b>ᖃᓪᓗᓈᑎᑐᑦ</b> (English) —ᖃᓪᓗᓈᖅ|ᖃᓪᓗᓈᑦ ᐅᖃᐅᓯᓕᕆᔨ.</p>"

Accept-Language: kk-latn
https://kk.wikipedia.org/api/rest_v1/feed/featured/2021/03/23
https://kk.wikipedia.org/api/rest_v1/page/summary/%D2%9A%D1%8B%D0%B7%D1%8B%D0%BB%D0%B0%D2%93%D0%B0%D1%88_%D0%BE%D2%9B%D0%B8%D2%93%D0%B0%D1%81%D1%8B
https://kk.wikipedia.org/api/rest_v1/page/mobile-html/%D0%9D%D0%B0%D1%83%D1%80%D1%8B%D0%B7_%D0%BC%D0%B5%D0%B9%D1%80%D0%B0%D0%BC%D1%8B

 "extract": "Наурыз мейрамы — ежелгі заманнан қалыптасқан жыл бастау мейрамы. Қазіргі күнтізбе бойынша күн мен түннің теңесуі кезіне келеді. Көне парсы тілінде нава=жаңа + рәзаңһ=күн, «жаңа күн» мағынасында, қазіргі парсы тілінде де сол мағынамен қалған, яғни «жаңа жылды» білдіреді.",
"extract_html": "<p><b>Наурыз мейрамы</b> — ежелгі заманнан қалыптасқан жыл бастау мейрамы. Қазіргі күнтізбе бойынша күн мен түннің теңесуі кезіне келеді. Көне парсы тілінде <i>нава</i>=жаңа + <i>рәзаңһ</i>=күн, «жаңа күн» мағынасында, қазіргі парсы тілінде де сол мағынамен қалған, яғни «жаңа жылды» білдіреді.</p>"

Actual Results:
Response for featured and summary endpoints contain non-Latin characters in values for extract and extract_html keys. Body html in mobile-html endpoint response contains non-Latin characters.

Expected Results:
extract, extract_html values and mobile-html endpoint response should contain Latin characters.

Note this works fine with Serbian:

Accept-Language: sr-el
https://sr.wikipedia.org/api/rest_v1/feed/featured/2021/03/23
https://sr.wikipedia.org/api/rest_v1/page/mobile-html/%D0%91%D1%80%D0%B0%D0%BD%D0%B8%D1%81%D0%BB%D0%B0%D0%B2_%D0%9B%D0%B5%D1%87%D0%B8%D1%9B
https://sr.wikipedia.org/api/rest_v1/page/summary/%D0%91%D1%80%D0%B0%D0%BD%D0%B8%D1%81%D0%BB%D0%B0%D0%B2_%D0%9B%D0%B5%D1%87%D0%B8%D1%9B

"extract": "Danijela Š'tajnfeld je srpska filmska, televizijska i pozorišna glumica.",
"extract_html": "<p><b>Danijela Š'tajnfeld</b> je srpska filmska, televizijska i pozorišna glumica.</p>"

vs.

Accept-Language: sr-ec
https://sr.wikipedia.org/api/rest_v1/feed/featured/2021/03/23
https://sr.wikipedia.org/api/rest_v1/page/mobile-html/%D0%91%D1%80%D0%B0%D0%BD%D0%B8%D1%81%D0%BB%D0%B0%D0%B2_%D0%9B%D0%B5%D1%87%D0%B8%D1%9B
https://sr.wikipedia.org/api/rest_v1/page/summary/%D0%91%D1%80%D0%B0%D0%BD%D0%B8%D1%81%D0%BB%D0%B0%D0%B2_%D0%9B%D0%B5%D1%87%D0%B8%D1%9B

"extract": "Данијела Штајнфелд је српска филмска, телевизијска и позоришна глумица.",
"extract_html": "<p><b>Данијела Штајнфелд</b> је српска филмска, телевизијска и позоришна глумица.</p>"

So far I have tested in only sr (works), iu (doesn't work), and kk (doesn't work) wikis but I can test further variants if needed.

Event Timeline

Hi @Tsevener ! I reproduced this issue and also checked response output for different language variants:

  1. kk-latn and kk-arab, kk-cn, kk-cyrl, kk-kz, kk-tr;
  2. ike-latn and ike-cans;

(More language codes here - https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all)

When perform query to /summary endpoint, found that only dispaytitle and title.display properties change. So it is for the /feed/featured endpoint.
Note: /summary and /mobilehtml endpoints perform in mobileapps but /feed/featured performs in the RESTBase.

Such properties like title, titles.canonical and titles.normalized seems to be the same for different language codes ( even for sr-el and sr-ec ).

When perform query to /mobilehtml, language codes applied to the <meta> tag as expected ( though text content inside html is the same except title inside header )
I've checked sr wiki and yes, it works as exprected.

I'm not sure about extract and extract_html properties. Seems that they come from parsoid output. But where is the place where they set? If, for example, some sr article has sr-el and sr-ec versions but with the same pageid, how can I be sure that kk article has all relevant language code versions?

Example from desktop wiki:

https://sr.wikipedia.org/sr-el/Бранислав_Лечић
https://sr.wikipedia.org/sr-ec/Бранислав_Лечић
^ This serbian article has latin and cyrillic versions

https://kk.wikipedia.org/kk-latn/Наурыз_мейрамы
https://kk.wikipedia.org/kk-cyrl/Наурыз_мейрамы
^ Unable to retrieve both articles (But this one works - https://kk.wikipedia.org/wiki/%D0%9D%D0%B0%D1%83%D1%80%D1%8B%D0%B7_%D0%BC%D0%B5%D0%B9%D1%80%D0%B0%D0%BC%D1%8B)

https://iu.wikipedia.org/ike-latn/%E1%96%83%E1%93%AA%E1%93%97%E1%93%88%E1%91%8E%E1%91%90%E1%91%A6
https://iu.wikipedia.org/ike-cans/%E1%96%83%E1%93%AA%E1%93%97%E1%93%88%E1%91%8E%E1%91%90%E1%91%A6
^ Unable to retrieve both articles (But this is working - https://iu.wikipedia.org/wiki/%E1%96%83%E1%93%AA%E1%93%97%E1%93%88%E1%91%8E%E1%91%90%E1%91%A6)

There is probably some misconception here, I think I need someone from the Language team to clarify this issue.

cc: @MSantos , @Jgiannelos

Maybe @Nikerabbit could help us to understand this from the Language Team perspective.

This is probably related to language converter, which isn't our area of expertise. I usually point language converter related things to the Parsing team.

The code.wikipedia.org/variant/Title urls are not canonical, it seems. Only some wikis like Serbian have them configured. https://kk.wikipedia.org/w/index.php?title=%D0%9D%D0%B0%D1%83%D1%80%D1%8B%D0%B7_%D0%BC%D0%B5%D0%B9%D1%80%D0%B0%D0%BC%D1%8B&variant=kk-latn is how the same looks in kk. But this is probably not relevant unless something in the Rest APIs assume that URL pattern.

T159985: Implement language variant support in the REST API says that Accept-Language is the way to select variants, so maybe this is a regression in functionality?

Thanks @Nikerabbit this is very helpful! I'll just add more information into what you've already confirmed for posteriority:

the reason that lang variant path in kkwiki doesn't work, but srwiki does, is because kkwiki doesn't have $wgVariantArticlePath [1] set in InitialiseSettings [2], which means that:

https://kk.wikipedia.org/kk-latn/Наурыз_мейрамы - won't work
https://kk.wikipedia.org/w/index.php?title=Наурыз_мейрамы&uselang=kk-latn - works

[1] https://www.mediawiki.org/wiki/Manual:$wgVariantArticlePath
[2] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/InitialiseSettings.php#14580

Now, from mobileapps code we request HTML from parsoid passing the Accept-Language header, maybe parsoid depends on the structure ${code}.wikipedia.org/${variant}/${title}? cc/ @cscott and @ssastry

Now, from mobileapps code we request HTML from parsoid passing the Accept-Language header, maybe parsoid depends on the structure ${code}.wikipedia.org/${variant}/${title}?

No, I think the header is the way to go. Parsoid doesn't reuse the language converter implementation from MediaWiki core. It has its own, which I believe has more limited support so far,
https://github.com/wikimedia/parsoid/tree/master/src/Language
https://github.com/wikimedia/langconv/tree/master/fst

There're open tasks about the progress starting with T204966

A few different issues here, I think:

  1. It seems this discussion is mixing a few different end points: we have "standard" article URLs like https://sr.wikipedia.org/sr-el/Бранислав_Лечић, and rest API requests that start with https://iu.wikipedia.org/api/rest_v1/. Among REST API requests, there is /feed/featured/, /page/summary, and /page/mobile-html, all of which are served from different services; as well as /page/html which is the underlying Parsoid service. I'm not certain whether all of these endpoints support Accept-Language, but we should separate them out and file separate bugs for each service.
  2. Parsoid (and thus parsoid-backed services exposed via the REST API) doesn't support all of the languages from core's language converter (yet). There will be gaps in language variant support.
  3. When you use the HTML "Accept-Language" header, you need to use the 'official' language codes for the languages in question. sr-el' and sr-ec', in particular, are not valid BCP 47 language codes (see T117845). You should use sr-Cyrl in your Accept-Language header.

Any or all of 1-3 could explain a failure to get converted output. You could use using a REST endpoint which doesn't support Accept-Language (even if other rest endpoints do). You could be trying to fetch a language which is not yet supported by Parsoid. And/or you could be using an invalid BCP 47 code for the language variant in question. (Finally: the site name of the request selects a particular wiki configuration, and its possible language converter is disabled on the named wiki, or that it has a slightly different language configuration.)

In order to make progress here I think we need to be very precise about what the bug(s) are, and file appropriate tasks -- each service is maintained by a separate team, so lumping issues with different services together makes it hard to assign an owner to the task.

  1. /feed/featured/, /page/summary/, and /page/mobile-html support the Accept-Language header or have support provided by restbase
  2. ...
  3. Oddly, even though sr-el and sr-ec aren't BCP 47 compliant, they work. Other variants such as kk-latn and ike-latn are BCP 47 compliant but won't work when requesting /page/html endpoint. The languages have MediaWiki support though, as can be confirmed by the following examples:

@cscott Maybe we are talking about a parsoid language coverage gap?