Page MenuHomePhabricator

Content copied from Content Translation into Visual Editor exposes internal attributes
Closed, ResolvedPublic

Description

Based on T144167#5075490 it seems some users copy content from Content translation to paste it into Visual Editor. As a result unnecessary attributes leak into the final result. Based on this example you can see that unnecessary HTML markup was removed such as the following:

<span data-segmentid="9" class="cx-segment">...<span>

This task is intended to:

  • Explore if there is a way for Content translation to reliably clean up contents when they are copied.
    • Check that any clean-up approach does not cause issues when pasting the contents in Content translation itself because of the lost metadata. This would limit the ability users have to move content around.
    • Check that the solution works when both using the copy&paste clipboard and drag&drop.
  • If there is no reliable solution from Content translation side, explore how to clean-up the contents when pasted into Visual Editor. Similar approaches may be in place for pasting content from other tools such as Microsoft Office.

Users may be doing this as a shortcut to expand existing articles with a translation of some new content, but that's just a guess. We don't know how often this behaviour is.

Event Timeline

Change 521521 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/extensions/ContentTranslation@master] Don't generate HTML for segments when copying

https://gerrit.wikimedia.org/r/521521

The content can be cleaned up when copying and the converter has separate modes when generating HTML for Parsoid/Clipboard.

Internal copy/paste and drag/drop don't use clipboard HTML so won't be affected.

Change 521521 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Don't generate HTML for segments when copying

https://gerrit.wikimedia.org/r/521521

@Esanders @santhosh can you give me an example of where those tags are used so I can see if they're still being passed correctly?

@Jpita Every paragraph in a Content translation target document has them (they are what make the sentences appear yellow when you hover on them). So just copy anything out of a CX translation, and paste it into a normal VE instance to test this.

I had to revert this fix because of T229906: Sentence pair highlighting broken
Revert patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/528406

We use the clipboard mode for extracting HTML for MT while adding a section to target language. Clipboard mode allows to localize the reference content, so it is important to use that mode. Since sentence annotation need to be present in translated content for highlighting pairs, now the highlighting is broken. More than highlighting, since some of MT annotation mapping for plain text MT services depend on sentences, they also misbehaving(T228498#5395412).

@Esanders If your fix can be done only for the target language, it would be good. But since ve.dm.CXSentenceSegmentAnnotation.static.toDomElements has no informaiton on the language, that is not easy. What do you suggest?

@Esanders If your fix can be done only for the target language, it would be good. But since ve.dm.CXSentenceSegmentAnnotation.static.toDomElements has no informaiton on the language, that is not easy. What do you suggest?

If we just did it based on the document language you could still write segments to the clipboard by copying from the source document.

Thinking about what the converter modes mean, I think clipboard mode might still be the correct mode to use, as essentially it means "for export to another VE instance, via some serialised storage".

I think what we should do is pass an additional flag to the converter saying isForTranslation, and then check for this in ve.dm.CXSentenceSegmentAnnotation.static.toDomElements

Strictly speaking we should extend the converter and create a new mode to do this, but we can just hack it for now:

ve.dm.converter.isForTranslation = true;
html = ve.dm.converter.getDomFromNode( ... );
ve.dm.converter.isForTranslation = false;

Change 528471 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/extensions/ContentTranslation@master] Re-apply "Don't generate HTML for segments when copying"

https://gerrit.wikimedia.org/r/528471

Change 528471 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Re-apply "Don't generate HTML for segments when copying"

https://gerrit.wikimedia.org/r/528471

I'd recommend adding some unit tests that assert that segments are preserved when translating.