Page MenuHomePhabricator

The apostrophe after the letter "Š"
Open, LowPublicBUG REPORT

Description

I noticed that the service /page/summary/{title} returns content with an obvious error when it comes to the first uppercase Serbian Latin letter "Š" followed by an apostrophe ("'").

In the following example, "Š'abac" should be "Šabac."

Request:

curl -X 'GET' \
  'https://sr.wikipedia.org/api/rest_v1/page/summary/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_%28%D0%A8%D0%B0%D0%B1%D0%B0%D1%86%29' \
  -H 'accept: application/json; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/Summary/1.4.2"' \
  -H 'Accept-Language: sr-el'

Response:

{
  "type": "standard",
  "title": "Слатина (Шабац)",
  "displaytitle": "<span class=\"mw-page-title-main\">Slatina (Šabac)</span>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q2736370",
  "titles": {
    "canonical": "Слатина_(Шабац)",
    "normalized": "Слатина (Шабац)",
    "display": "<span class=\"mw-page-title-main\">Slatina (Šabac)</span>"
  },
  "pageid": 179381,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/6/6e/Slatina_002.jpg/320px-Slatina_002.jpg",
    "width": 320,
    "height": 480
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/6/6e/Slatina_002.jpg",
    "width": 2304,
    "height": 3456
  },
  "lang": "sr",
  "dir": "ltr",
  "revision": "27928519",
  "tid": "75c0823a-3f95-11ef-943a-965eb61dbf2b",
  "timestamp": "2024-07-11T14:54:23Z",
  "description": "насеље у општини Шабац, Мачвански округ, Србија",
  "description_source": "central",
  "coordinates": {
    "lat": 44.725333,
    "lon": 19.7715
  },
  "content_urls": {
    "desktop": {
      "page": "https://sr.wikipedia.org/wiki/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)",
      "revisions": "https://sr.wikipedia.org/wiki/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)?action=history",
      "edit": "https://sr.wikipedia.org/wiki/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)?action=edit",
      "talk": "https://sr.wikipedia.org/wiki/%D0%A0%D0%B0%D0%B7%D0%B3%D0%BE%D0%B2%D0%BE%D1%80:%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)"
    },
    "mobile": {
      "page": "https://sr.m.wikipedia.org/wiki/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)",
      "revisions": "https://sr.m.wikipedia.org/wiki/Special:History/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)",
      "edit": "https://sr.m.wikipedia.org/wiki/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)?action=edit",
      "talk": "https://sr.m.wikipedia.org/wiki/%D0%A0%D0%B0%D0%B7%D0%B3%D0%BE%D0%B2%D0%BE%D1%80:%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)"
    }
  },
  "extract": "Slatina je naselje u Srbiji u opštini Š'abac u Mačvanskom okrugu. Prema popisu iz 2011. bilo je 215 stanovnika.",
  "extract_html": "<p><b>Slatina</b> je naselje u Srbiji u opštini Š'abac u Mačvanskom okrugu. Prema popisu iz 2011. bilo je 215 stanovnika.</p>"
}

Event Timeline

Aklapper changed the subtype of this task from "Task" to "Bug Report".Oct 9 2024, 12:59 PM
daniel subscribed.

Pinging Content-Transform-Team and Language and Product Localization to see if they have any ideas what the issue might be. IIRC, /page/summary/{title} is backed by the Page Content Service. Also, I note that the page is written in cyrillic, so perhaps the issue is with MediaWiki-Language-converter?

I think this might have something to do with moving between Serbian cyrillic and latin letters. As I understand it, sr is the version of Serbian that uses Cyrillic whereas sh is the version that uses the latin alphabet.

When I visit the same page as described above (https://sr.wikipedia.org/api/rest_v1/page/summary/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_%28%D0%A8%D0%B0%D0%B1%D0%B0%D1%86%29), but without Accept-Language specified, I get extract and extract_html in Serbian cyrillic (instead of in latin characters) and the apostrophe is missing.

Similarly, if I visit the latin alphabet version (same url as above, but replacing sr.wikipedia.org with sh.wikipedia.org), I get the extract and extract_html values in latin characters, but WITHOUT the unwanted apostrophe

Cyrillic version without Accept-Language header specified:

extract: "Слатина је насеље у Србији у општини Шабац у Мачванском округу. Према попису из 2011. било је 215 становника."
extract_html: "<p><b>Слатина</b> је насеље у Србији у општини Шабац у Мачванском округу. Према попису из 2011. било је 215 становника.</p>"

Latin version (also without Accept-Language header specified):

extract: "\n\nSlatina je naselje u Srbiji u opštini Šabac u Mačvanskom okrugu. Prema popisu iz 2011. bilo je 215 stanovnika."
extract_html: "<p>\n\n<b>Slatina</b> je naselje u Srbiji u opštini Šabac u Mačvanskom okrugu. Prema popisu iz 2011. bilo je 215 stanovnika.</p>"

I think the problem here might be the handling of language variants after we switched over srwiki PCS from RESTBase to REST-gateway.

Yeah, i just verified it on staging without caching:

jgiannelos@deploy2002:~$ curl -q https://staging.svc.eqiad.wmnet:4102/sr.wikipedia.org/v1/page/summary/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_%28%D0%A8%D0%B0%D0%B1%D0%B0%D1%86%29 -H "cache-control: no-cache" -H "Accept-Language: sr-el" | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2354  100  2354    0     0   2636      0 --:--:-- --:--:-- --:--:--  2633
{
...
  "extract": "Slatina je naselje u Srbiji u opštini Š'abac u Mačvanskom okrugu. Prema popisu iz 2011. bilo je 215 stanovnika.",
  "extract_html": "<p><b>Slatina</b> je naselje u Srbiji u opštini Š'abac u Mačvanskom okrugu. Prema popisu iz 2011. bilo je 215 stanovnika.</p>"
}
jgiannelos@deploy2002:~$ curl -q https://staging.svc.eqiad.wmnet:4102/sr.wikipedia.org/v1/page/summary/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_%28%D0%A8%D0%B0%D0%B1%D0%B0%D1%86%29 -H "cache-control: no-cache" | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2528  100  2528    0     0   4137      0 --:--:-- --:--:-- --:--:--  4130
{
...
  "extract": "Слатина је насеље у Србији у општини Шабац у Мачванском округу. Према попису из 2011. било је 215 становника.",
  "extract_html": "<p><b>Слатина</b> је насеље у Србији у општини Шабац у Мачванском округу. Према попису из 2011. било је 215 становника.</p>"
}

I think the problem is on Parsoid output level
Check:

curl --request GET \
  --url https://sr.wikipedia.org/api/rest_v1/page/html/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_%28%D0%A8%D0%B0%D0%B1%D0%B0%D1%86%29 \
  --header 'accept-language: sr-el'

image.png (778×2 px, 1 MB)

From https://sr.wikipedia.org/wiki/%D0%A1%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B0_(%D0%A8%D0%B0%D0%B1%D0%B0%D1%86)

image.png (778×2 px, 553 KB)

HCoplin-WMF subscribed.

Removing MW Interfaces tag since this seems to be a parsoid issue, per comment above.

This issue can be reproduced with:

echo "Шабац" | php bin/parse.php --wt2html --htmlVariantLanguage sr-Latn --domain sr.wikipedia.org