Cannot load a saved translation, JS error from Sizzle at $section.data( 'source' )
Closed, ResolvedPublic

Description

Today I observed a user who was trying to load an article that was in progress of translation. It failed, and the article's content couldn't be seen. JS console showed an error:
Uncaught Error: Syntax error, unrecognized expression: #.D7.9E.D7.A7 ... etc.

This happened in translation of "פתח תקווה" from Hebrew to Spanish.

This comes from ext.cx.translation.loader.js line 311.

jQuery probably tries to parse #.D7.9E.D7.A7 as a selector, and it's a non-ASCII string from article content that was encoded in a way that jQuery cannot parse.

This causes apparent data loss—the data is probably stored in corpora, but cannot actually be loaded. Non-ASCII strings are very common in Wikipedia, so it's an important issue to fix.

There were many similar complaints on Talk:CX lately about such problems, and this is probably the reason.

Amire80 created this task.Dec 29 2016, 10:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 29 2016, 10:18 AM
Amire80 triaged this task as High priority.Dec 29 2016, 10:18 AM
Amire80 updated the task description. (Show Details)
Arrbee added a subscriber: Arrbee.Jan 2 2017, 7:13 AM

An article has been requested by the devs as an example of this problem. Thanks.

I'm not sure this example is exactly this bug or not but you can check Biological network from English to Persian (fa) which is being translated by User:Z.navidi

Amire80 updated the task description. (Show Details)Jan 2 2017, 7:55 AM
KartikMistry added a subscriber: KartikMistry.EditedJan 3 2017, 5:41 AM

Another example:

  • Source Wiki: enwiki
  • Target Wiki: guwiki
  • Article: Citron

I don't have much details as this is report from other user.

I don't know where to start here....

The issue is caused by a CSS query selector that starts with #.. That is invalid and throws an error. In other code we long ago avoided doing $( '#' + random_input ) just because of this issue, but in the loader code there were still calls like that.

And the reason why the CSS selector starts with #. is that for non-ascii headings, MediaWiki uses this .DE.AD.BE.EF type hex encoding for legacy reasons.

And one more issue is needed to trigger this error: the cxc_section_id column is varbinary(30). On production sites the ID's get truncated to length of 30 bytes (and giving the encoding takes ~6 bytes for each non-ascii character... that means most of the headings. On my wiki I get SQL errors that prevented the article from being saved, because I have strict mode enabled.

After fixing the CSS selector issue with a patch, to avoid the JavaScript error, this is what I got:

I had three translated sections:

  • אוכלוסייה (see [1])

*חינוך (now completely disappeared!, but has corresponding empty section in the source column (the line without text)

  • the image (which is fine)

[1] Which is still associated with the correct heading, just mis-aligned. Curiously I can re-translate this heading now, so I have three sections which are connected:

This leads me to conclude, that unless my change broke the section alignment code (which is highly unlikely, as the only change is using document.getElementById instead of querySelectorAll), the restoration code is borked and causing data loss.

TL;DR

  1. We need to come up with a way to deal with id's longer than 30 bytes (in fact, as far as I know these can be unlimited, so just increasing the column length won't work)
  2. We need to fix the restoration code.

Curious, this may also resolve some alignment bugs such as T152098.

Change 330379 had a related patch set uploaded (by Nikerabbit):
Workaround to fix restoration for truncated section ids

https://gerrit.wikimedia.org/r/330379

Another example:

  • Source Wiki: enwiki
  • Target Wiki: guwiki
  • Article: Citron

    I don't have much details as this is report from other user.

Sorry, this seems different bug.

The <h{1,2,3,4,5,6}> tags used to have parsoid style ids(Example: mwAy) and we did not had this problem. Recently parsoid changed the ids of header to match the section anchors produced by PHP parser. See the change : https://github.com/wikimedia/parsoid/commit/082ea420cf73e53a855e5503e4f4fd0f04b5ad74 based on T102209: Derive heading ids from heading name, the same way MW core does

santhosh added a comment.EditedJan 5 2017, 7:25 AM

T102209: Derive heading ids from heading name, the same way MW core does has several issues affecting CX:

  1. The header tag ids are no longer valid. An id starting with a period like .DE.AD.BE.EF is not a valid ID. jQuery sizzle cannot work with them. Span tags inside header tags used to had these ids, but now spans were removed from header tags and ids moved to header tag, replacing parsoid generated ids.
  2. These ids for section headers cannot be used a database primary key since they are not length limited and can be any long.
  3. Section header ids will always change if header text changes.

Even if the ids are valid, the change from parsoid generated ids to new system affected our translation restore feature for all non-ascii translations.
Basically an H2 tag having id like mwAz changed to .D8.B4.D8.A8.DA.A9.D9.87.D9.94_.D8.B2.DB.8C.D8.B3.D8.AA.DB.8C_.D9.88_.D8.A8.DB.8C.D9.88.D8.A7.D9.86.D9.81.D9.88.D8.B1.D9.85.D8.A7.D8.AA.DB.8C.DA.A9

Change 330379 merged by jenkins-bot:
Workaround to fix restoration for truncated section ids

https://gerrit.wikimedia.org/r/330379

Change 330669 had a related patch set uploaded (by KartikMistry):
Workaround to fix restoration for truncated section ids

https://gerrit.wikimedia.org/r/330669

Change 330669 merged by jenkins-bot:
Workaround to fix restoration for truncated section ids

https://gerrit.wikimedia.org/r/330669

Mentioned in SAL (#wikimedia-operations) [2017-01-05T14:15:09Z] <hashar@tin> Synchronized php-1.29.0-wmf.7/extensions/ContentTranslation: Workaround to fix restoration for truncated section ids - T154279 (duration: 02m 10s)

I tested, and this seems to work now in production. Leaving it open in case more work is needed.

Amire80 added a comment.EditedJan 10 2017, 6:35 PM

Another instance of a very similar issue, which still appears today, so it's not entirely fixed. I have a reproducible example; sorry it's silly, but that's what I have ;)

To reproduce:

  • Translate Cats and the Internet from English to Portuguese. (This probably happens with any target language.)
  • Go to the section heading "Everytime you masturbate... God kills a kitten"
  • Close the tab
  • Try to load the translation

Observed: Error: Syntax error, unrecognized expression: #cxEverytime_you_masturbate..._God_kills_a_kitten

Tested in Firefox.

Could you post file and line numbers, or rather the whole backtrace with debug mode enabled?

Could you post file and line numbers, or rather the whole backtrace with debug mode enabled?

Error: Syntax error, unrecognized expression: #cxEverytime_you_masturbate..._God_kills_a_kitten  load.php:1496:8
	Sizzle</Sizzle.error https://fi.wikipedia.org/w/load.php:1496:8
	Sizzle</Sizzle.tokenize https://fi.wikipedia.org/w/load.php:2113:4
	Sizzle https://fi.wikipedia.org/w/load.php:859:14
	.find https://fi.wikipedia.org/w/load.php:2733:4
	ContentTranslationLoader.prototype.restoreSection https://fi.wikipedia.org/w/extensions/ContentTranslation/modules/translation/ext.cx.translation.loader.js:411:14
	ContentTranslationLoader.prototype.restore https://fi.wikipedia.org/w/extensions/ContentTranslation/modules/translation/ext.cx.translation.loader.js:226:23
	ContentTranslationLoader.prototype.fetch/</< https://fi.wikipedia.org/w/extensions/ContentTranslation/modules/translation/ext.cx.translation.loader.js:108:5
	jQuery.Callbacks/fire https://fi.wikipedia.org/w/load.php:3148:10
	jQuery.Callbacks/self.fireWith https://fi.wikipedia.org/w/load.php:3260:7
	mw.hook/<.fire https://fi.wikipedia.org/w/load.php:13064:14
	ContentTranslationEditor.prototype.listen/</< https://fi.wikipedia.org/w/extensions/ContentTranslation/modules/translation/ext.cx.translation.js:151:5

Looks like I missed .find( '#' + ) pattern:

$section = this.$translationColumn.find( '#' + targetSectionId );

Change 331674 had a related patch set uploaded (by Santhosh):
Change the h1,h2.. ids to fixed length md5 based ids

https://gerrit.wikimedia.org/r/331674

Change 332367 had a related patch set uploaded (by Nikerabbit):
Also convert find( '#' ... ) to use [id="..."] type of selector.

https://gerrit.wikimedia.org/r/332367

Change 332388 had a related patch set uploaded (by Nikerabbit):
Avoid database errors for too long section ids

https://gerrit.wikimedia.org/r/332388

Change 332367 merged by jenkins-bot:
Also convert find( '#' ... ) to use [id="..."] type of selector.

https://gerrit.wikimedia.org/r/332367

Change 332388 merged by jenkins-bot:
Avoid database errors for too long section ids

https://gerrit.wikimedia.org/r/332388

Change 331674 merged by jenkins-bot:
Change the h1,h2.. ids to fixed length sha256 based ids

https://gerrit.wikimedia.org/r/331674

The fixes were done by @Nikerabbit and @santhosh. Waiting for deployment.

Amire80 closed this task as Resolved.Feb 20 2017, 12:54 PM