Page MenuHomePhabricator

[[w:ru:Янская стоянка]] is not translatable via CX
Open, MediumPublic

Description

Steps to reproduce

Actual result:

  • Critical error: Content translation failed to load due to internal error.

Expected result:

  • Page loads for translation

Investigation and impact

Pages using map frame feature from the Kartographer extension fail to load in Content Translation. This is because those features uses data-mw HTML attribute in ways not expected by cxserver, causing cxserver to fail when processing the page.

Event Timeline

CXServer response:

Page ru:Янская_стоянка could not be found. SyntaxError: Unexpected token i in JSON at position 0

This is for:

GET	https://cxserver.wikimedia.org/v2/page/ru/fi/%D0%AF%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D1%81%D1%82%D0%BE%D1%8F%D0%BD%D0%BA%D0%B0

Relevant code:

lib/routes/v2.js
38-     fetchPage( req, res ) {
39-             const title = req.params.title,
40-                     revision = req.params.revision;
41-
42-             // In case of wikimedia service hosting, cxserver is configured behind
43-             // xx.wikipedia.org/api/.. instead of language code, we get domain name.
44-             // Split by . seems safe here. But note that this is not respecting
45-             // domain pattern configured by mw_host
46-             const sourceLanguage = req.params.sourcelanguage.split( '.' )[ 0 ];
47-             const targetLanguage = req.params.targetlanguage.split( '.' )[ 0 ];
48-             if ( !languageData.isKnown( sourceLanguage ) ) {
49-                     return res.status( 400 )
50-                             .end( `Invalid language code for page fetch: ${sourceLanguage}` );
51-             }
52-             if ( !languageData.isKnown( targetLanguage ) ) {
53-                     return res.status( 400 )
54-                             .end( `Invalid language code for target language in page fetch: ${targetLanguage}` );
55-             }
56-             const pageLoader = new MWPageLoader( {
57-                     context: this.app,
58-                     sourceLanguage,
59-                     targetLanguage
60-             } );
61-
62-             this.app.logger.log( 'debug', `Getting page ${sourceLanguage}:${title} for ${targetLanguage}` );
63-             return pageLoader.getPage( title, revision, true /* wrapSections */ ).then(
64-                     ( response ) => {
65-                             res.send( {
66-                                     sourceLanguage,
67-                                     title,
68-                                     revision: response.revision,
69-                                     segmentedContent: response.content,
70-                                     categories: response.categories
71-                             } );
72-                             this.app.logger.log( 'debug', 'Page sent' );
73-                     },
74-                     ( error ) => {
75-                             res.status( 404 )
76:                                     .end( `Page ${sourceLanguage}:${title} could not be found. ` + error.toString() );
77-                     }
78-             );

Rest API looks okay: https://ru.wikipedia.org/api/rest_v1/page/html/%D0%AF%D0%BD%D1%81%D0%BA%D0%B0%D1%8F_%D1%81%D1%82%D0%BE%D1%8F%D0%BD%D0%BA%D0%B0

This feels like category adaptation is a suspect.

Putting on breaking bugs for investigation. Currently no idea how common this is, as logging is not catching these.

I was able to reproduce this locally. I was able to isolate that the JSON parsing error originates inside cxserver, in LinearDoc.parser. It only had two instances of JSON parsing, the one that fails is data-mw parser. Example output:

{"name":"cxserver","hostname":"jadekukka","pid":28672,"level":20,"levelPath":"debug","msg":"Getting page ru:Янская_стоянка for en","time":"2021-02-08T13:54:15.372Z","v":0}
AAA
CCC
DDD
EEE
XXX1 {"parts":[{"template":{"target":{"wt":"coord","href":"./Шаблон:Coord"},"params":{"1":{"wt":"70.720168"},"2":{"wt":"N"},"3":{"wt":"135.385254"},"4":{"wt":"E"},"display":{"wt":"title"}},"i":0}}]}
XXX2
XXX1 {"parts":[{"template":{"target":{"wt":"coord","href":"./Шаблон:Coord"},"params":{"1":{"wt":"70.720168"},"2":{"wt":"N"},"3":{"wt":"135.385254"},"4":{"wt":"E"},"display":{"wt":"title"}},"i":0}}]}
XXX2
XXX1 {"parts":[{"template":{"target":{"wt":"coord","href":"./Шаблон:Coord"},"params":{"1":{"wt":"70.720168"},"2":{"wt":"N"},"3":{"wt":"135.385254"},"4":{"wt":"E"},"display":{"wt":"title"}},"i":0}}]}
XXX2
XXX1 {"parts":[{"template":{"target":{"wt":"ПозКарта+","href":"./Шаблон:ПозКарта+"},"params":{"1":{"wt":"Россия Якутия"},"width":{"wt":"200"},"float":{"wt":"right"},"caption":{"wt":"Янская стоянка"},"places":{"wt":"{{ПозКарта~|Россия Якутия|lat=70.720168|lon=135.385254|position=left|background= |label=}}"}},"i":0}}]}
XXX2
XXX1 {"parts":[{"template":{"target":{"wt":"ПозКарта+","href":"./Шаблон:ПозКарта+"},"params":{"1":{"wt":"Россия Якутия"},"width":{"wt":"200"},"float":{"wt":"right"},"caption":{"wt":"Янская стоянка"},"places":{"wt":"{{ПозКарта~|Россия Якутия|lat=70.720168|lon=135.385254|position=left|background= |label=}}"}},"i":0}}]}
XXX2
XXX1 {"caption":"Янская стоянка (Якутия)"}
XXX2
XXX1 {"caption":"Янская стоянка (Якутия)"}
XXX2
XXX1 {"caption":"Янская стоянка (Якутия)"}
XXX2
XXX1 {"parts":[{"template":{"target":{"wt":"ПозКарта+","href":"./Шаблон:ПозКарта+"},"params":{"1":{"wt":"Россия Якутия"},"width":{"wt":"200"},"float":{"wt":"right"},"caption":{"wt":"Янская стоянка"},"places":{"wt":"{{ПозКарта~|Россия Якутия|lat=70.720168|lon=135.385254|position=left|background= |label=}}"}},"i":0}}]}
XXX2
XXX1 {"name":"mapframe","attrs":{"latitude":"70.720168","longitude":"135.385254","zoom":"12","width":"288","height":"350","align":"right"}}
XXX2
XXX1 {"name":"mapframe","attrs":{"latitude":"70.720168","longitude":"135.385254","zoom":"12","width":"288","height":"350","align":"right"}}
XXX2
XXX1 interface
<end of output - error happens here>

Cleary interface is not a valid JSON blob. This originates from parsoid output:

<div class="mw-kartographer-container thumb tright" typeof="mw:Extension/mapframe" about="#mwt6" data-mw='{"name":"mapframe","attrs":{"latitude":"70.720168","longitude":"135.385254","zoom":"12","width":"288","height":"350","align":"right"}}' id="mwBQ">
  <div class="thumbinner" style="width: 288px;">
    <a class="mw-kartographer-map" style="width: 288px; height: 350px;" data-mw="interface" data-style="osm-intl" data-width="288" data-height="350" data-zoom="12" data-lat="70.720168" data-lon="135.385254" href="/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:Map/12/70.720168/135.385254/ru">
    <img src="https://maps.wikimedia.org/img/osm-intl,12,70.720168,135.385254,288x350.png?lang=ru" alt="" width="288" height="350" decoding="async" srcset="https://maps.wikimedia.org/img/osm-intl,12,70.720168,135.385254,288x350@2x.png?lang=ru 2x"/></a><div class="thumbcaption">
  </div>
</div>

data-mw="interface" is on the a element of a map frame. This is produced generated by the Kartographer extension. Specifically from https://gerrit.wikimedia.org/g/mediawiki/extensions/Kartographer/+/3e9034c0c6231f0fec6dd6538b09d1f701ff25f9/includes/Tag/MapFrame.php#106

Based on this: this affects all pages with map frames or map links. Supposedly all of these: https://ru.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F:%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D1%8B_%D1%81_%D0%BA%D0%B0%D1%80%D1%82%D0%B0%D0%BC%D0%B8 ( see also interlanguage links).

As for how to fix this:

  1. Cxserver probably shouldn't be parsing mw-data thing on elements not marked as templates.
  2. Cxserver chould fail gracefully on invalid JSON and log it
  3. Kartographer should not be using data-mw: it is reserved for parser/parsoid per https://gerrit.wikimedia.org/g/mediawiki/core/+/c2d66ab8179dd33f17478f059ffa0ee01e4df4db/includes/parser/Sanitizer.php#520:
 		// data-ooui is reserved for ooui.
		// data-mw and data-parsoid are reserved for parsoid.
		// data-mw-<name here> is reserved for extensions (or core) if
		// they need to communicate some data to the client and want to be
		// sure that it isn't coming from an untrusted user.

Tagging Parsing people for feedback&awareness and tagging the Product Infrastructure team per maintainers with the ask of migrating away from data-mw in Kartographer (unless Parsing advices otherwise). Leaving for @Pginer-WMF to prioritize. I suggest that we implement 1 and/or 2 in cxserver regardless of other actions.

Supposedly all of these

It seems that maplink does not affect this, e.g. https://ru.wikipedia.org/wiki/Улица_21_Сентября which has maplink in its coordinates seems to be translatable

Supposedly all of these

It seems that maplink does not affect this, e.g. https://ru.wikipedia.org/wiki/Улица_21_Сентября which has maplink in its coordinates seems to be translatable

That was my assumption from https://codesearch.wmcloud.org/search/?q=data-mw&i=nope&files=&excludeFiles=&repos=Extension:Kartographer - I did not verify. It is possible that only map frame triggers this code path.

Though perhaps you are right it might be both, it is just that I do not see ( https://ru.wikipedia.org/w/index.php?title=Служебная:Поиск&limit=500&offset=0&ns0=1&search=insource%3A%2F\maplink%2F&advancedSearch-current={} ) any maplinks used in the article outside of infobox, and as to infobox it seems that when a mapframe is there e.g. https://ru.wikipedia.org/wiki/Таран_(маяк) , the translatability is not affected, but when it is outside infobox, e.g. https://ru.wikipedia.org/wiki/Ландшафтный_парк_«Митино» then the same error occurs.

Hm, though https://en.wikipedia.org/wiki/Pennsylvania (has mapframe outside infobox) causes error, and https://en.wikipedia.org/wiki/Kochi_Metro (mapframe in infobox, maplink outside infobox) seemingly does not.

Pginer-WMF lowered the priority of this task from High to Medium.

I think the proper resolution of this task was outlined by @Nikerabbit:

As for how to fix this:
[...]

  1. Kartographer should not be using data-mw: it is reserved for parser/parsoid per https://gerrit.wikimedia.org/g/mediawiki/core/+/c2d66ab8179dd33f17478f059ffa0ee01e4df4db/includes/parser/Sanitizer.php#520:

[...]
Tagging Parsing people for feedback&awareness and tagging the Product Infrastructure team per maintainers with the ask of migrating away from data-mw in Kartographer (unless Parsing advices otherwise). Leaving for @Pginer-WMF to prioritize. I suggest that we implement 1 and/or 2 in cxserver regardless of other actions.

data-mw should be reserved for Parsoid; Kartographer should use data-mw-kartographer (or something). We just need to figure out "whose job it is" to fix Kartographer.

As for how to fix this:

  1. Cxserver probably shouldn't be parsing mw-data thing on elements not marked as templates.

...
I suggest that we implement 1 and/or 2 in cxserver regardless of other actions.

Note that data-mw is present on elements other than templates. It is present on extensions as well as media (images, audio, video). ( See https://www.mediawiki.org/wiki/Specs/HTML )

Change 662943 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/services/cxserver@master] LinearDoc: Do not fatal on non-JSON mw-data attribute

https://gerrit.wikimedia.org/r/662943

Note that data-mw is present on elements other than templates. It is present on extensions as well as media (images, audio, video). ( See https://www.mediawiki.org/wiki/Specs/HTML )

In LinearDoc, where this error happens, the code is only checking ´data-mw` for the wikitext name of the template in order to decide if it should be stripped or not. You are right that data-mw is not only for templates.

@Nikerabbit and @Pginer-WMF given that this bug has been around from the beginning, I take it this is not an ultra high priority issue and it is okay for it to take a few weeks to get done on the Kartographer end.

Yes it's not urgent, especially because we are implementing a workaround in cxserver.

Change 662943 merged by jenkins-bot:
[mediawiki/services/cxserver@master] LinearDoc: Do not fatal on non-JSON data-mw attribute

https://gerrit.wikimedia.org/r/662943

Change 663213 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Update cxserver to 2021-02-10-134029-production

https://gerrit.wikimedia.org/r/663213

Change 663213 merged by jenkins-bot:
[operations/deployment-charts@master] Update cxserver to 2021-02-10-134029-production

https://gerrit.wikimedia.org/r/663213

Mentioned in SAL (#wikimedia-operations) [2021-02-11T06:45:19Z] <kart_> Updated cxserver to 2021-02-10-134029-production (T274133, T273456, T271980)

Nikerabbit added a subscriber: Jpita.

@Jpita Can you test this? If it's okay, I suggest we drop the Language team sprint tag from this project and let Product Infrastructure use this task.

MSantos added subscribers: MSantos, thiemowmde, awight, WMDE-Fisch.

Dropping old team tag that we shouldn't be using anymore.

@MSantos: Could you elaborate please why, and who exactly is "we", and other team tag is used (or maybe not), or who else could elaborate? Thanks a lot! :)