Page MenuHomePhabricator

Use content matching of source sections for improving the translation restore
Open, HighPublic

Description

As per the current CX2 code, all sections are restored, even though the restored sections do not always match with the original source section. This is a bug and blocks T168287: CX2: Warning about article having changed too much.

Currently, section numbers are used to match sections. Section numbers are assigned by cxserver(?) in the order of sections in the source text, starting from 1. So if there is an article with 20 sections with translations, during restoration the first translated section matches the first source section, second translated section matches the second source section and so on. This works well only if the source article did not change at all.

But, if the original source article changed, imagine 4th and 5th section swapped. Currently, the 4th source section still gets the 4th saved translation, even though the content of fourth section is original source article's 5th section.

Translated sections should not be restored based on the section number. It does not correspond to the content in that section. Parsoid also assigns ids. Those ids aim to be stable with minor content changes across revisions. But in my (@santhosh) experience, even changes that look small may cause the parsoid id to change. We cannot rely 100% on them either for section restoration. There will be sections without a matching parsoid id across revisions of source article.

I propose the following improvement for the section restoration:

  1. Use parsoid id to locate a source section for the saved translation. For CX2 translations without section wrapping, this is the id of the block tags. For example, <p id="mwAc">..</p>. Here the parsoid id for the section is mwAc. For the sections with <section> tag wrapping, the parsoid id is the id of first immediate child of that section. For a section like <section rel="cxSection" id="cxSourceSection34"><p id="mwAc">..</p><section>, the parsoid id is mwAc. If any of the saved translation section has parsoid id mwAc , then it get restored for a source section with same parsoid id.
  2. If the source article changed a lot, we might not see a matching parsoid id in any of the source sections. We cannot use the linear order of the section as fallback either, because it has the same issue as with section numbers. Sometimes a section heading will restored against a figure or paragraph if we do this. Instead, we need to find a source section that has common tokens with the saved sections.
    1. Common tokens are simply the words that are common in source section of new revision of source article and in the source section we saved along with translation.
    2. We will define a threshold ratio to say if two sections are very similar enough to match. I propose threshold > 0.5.
    3. Tokenization is done in the same way we do when calculating section progress. It is based on the text value of section. It uses language aware tokenization. So for languages that do not use spaces, tokens are characters.
    4. If the old source section and new source section have different tags (e.g. <p> vs. <h1>), we can immediately reject the pair as not matching.
  3. If we still did not find a source section for a saved section, proceed with T168287: CX2: Warning about article having changed too much

A test case:

  1. Take https://en.wikipedia.org/wiki/Phantosmia translation to simple english as example. Use 'Source text' as translation method. Do a translation with one of its older revisions. I used a revision that is 1 year old. https://en.wikipedia.org/w/index.php?title=Phantosmia&oldid=800766238. To use this particular revision for source, add revision=800766238 in the translation URL. Example: title=Special:ContentTranslation&page=Phantosmia&from=en&to=simple&targettitle=Phantosmia&version=2
  2. Translate all sections. I translated 68 sections. All saved.
  3. Just reload the translation editor. You will see all 68 sections restored. This is because we are using a particular revision, nothing changed in source article. So section number based restore works.
  4. Remove the revision=800766238 from URL and load the translation again.
  5. You will see all sections restored. But with lot of misalignment, heading restored against paragraphs, sections restored against source sections which has no relation etc.
  6. If you do the previous step of loading the translation after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460206 the section alignment will be correct. But the source-translation is not matching at all.
  7. I implemented the above proposed approach in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460479. With that patch, you should get all 68 sections restored without any issue.

Here is an example, showing different parsoid id mwCQ, mwCA got restored based on the content match.

Source articleRestored translation

The section content is given in below screenshot. You can see that there are some reference changes, which might have caused a new parsoid id.

I also found that if we don't do the common content based matching then about 32 sections were not restored.

Event Timeline

santhosh created this task.Sep 14 2018, 9:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 14 2018, 9:33 AM

Change 460479 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ContentTranslation@master] Content match based section restore improvement

https://gerrit.wikimedia.org/r/460479

Nikerabbit updated the task description. (Show Details)Sep 17 2018, 6:08 AM
santhosh added a subscriber: Nikerabbit.EditedSep 17 2018, 10:05 AM

@Nikerabbit pointed out this issue while reviewing:

Doesn't this mess up parallel corpora even more badly now that sections ids get reassigned? I think all sections, where restoration does not have full match, should have new (preferably unused) id assigned and source text re-saved

Note that source content is resaved in every translation section. This make sure that the source content in cx_corpora is not from any old revision of article.

About the state of cx_corpora table, let us analyse

For a translation, assume we have a swap for sections 2 and 3 in new revision of article. And there is a 4th section for which no corresponding source section is found

From old translation

cxc_idcxc_translation_idcxc_section_idcxc_origincxc_content
235610002sourcesource content from earlier revision for second section
235710002ApertiumTranslation of second section
235810003sourcesource content from earlier revision for third section
235910003ApertiumTranslation of third section
236010004sourcesource content from earlier revision for fourth section
236110004ApertiumTranslation of fourth section

After restoring and saving:

cxc_idcxc_translation_idcxc_section_idcxc_origincxc_content
235610002sourceUpdated source for second section
235710002ApertiumTranslation of third section
235810003sourcesource content from third section
235910003ApertiumTranslation of second section
236010004sourcesource content from fourth section which now changed completely and translation not restored.
236110004ApertiumTranslation of fourth section - now orphan, will get restored if translator use old revision of source article.

Just to understand, are you proposing to update the record 2361 with a new (unused) secion id, so that when parallel corpora is created it wont go with 2360(source section=4) ?

Nikerabbit updated the task description. (Show Details)Sep 18 2018, 12:13 PM

@santhosh Is the mention of CX1 in the summary a typo? We should not touch CX1 code at this point to avoid regressions. Is the source as cxc_origin in for cxc_id 2358 in the second table a typo? I would expect it to say Apertium.

Note that source content is resaved in every translation section.

When exactly? On every change? On restoration without other changes?

Just to understand, are you proposing to update the record 2361 with a new (unused) secion id, so that when parallel corpora is created it wont go with 2360(source section=4) ?

My concern is that parallel corpora will be (close to) useless if we have rows like:

cxc_translation_idcsc_section_idcxc_origincxc_content
11sourceFoo
11userTranslation of "Bar" as opposed to "Foo"

I consider it as requirement that for all rows with same cxc_section_id then for all values of cxc_origin for those rows must be related to same text "Foo". Only exception is if the user puts different content in the translation that what is in the source, which we cannot control.

This requirement can be fulfilled in many ways. The simplest way is to always save all parts of the triplet (source, MT, user) when saving a section. This way we will override any old data that could be related to an old section. On top of this we can optimize, to not require saving of parts which have not changed.

I did not mean any change to CX1. Where is the summary? I corrected cxc_origin value

Note that source content is resaved in every translation section.

When exactly? On every change? On restoration without other changes?

In the first save of a changed section in a restored translation.

Nikerabbit updated the task description. (Show Details)Sep 19 2018, 6:46 AM

With the patch, the restored sections are nicely matched, but sentence highlight is non-existent in many cases or way off in others, like shown on the following screenshot:

@Petar.petkovic That is a general problem with any restore. We have not worked on such a sentence matching algorithm. Infact, we did not promise accuracy for the sentence highlight than 'as best as we can do'.
+cc @Pginer-WMF please confirm.

There may be something we can do to avoid highlighting sentence in non-connected sections(as shown in image above). But if we do, it need to be a new ticket.

@Petar.petkovic That is a general problem with any restore. We have not worked on such a sentence matching algorithm. Infact, we did not promise accuracy for the sentence highlight than 'as best as we can do'.
+cc @Pginer-WMF please confirm.

That's right. The translation is editable and such modifications are going to make it hard to keep sentence highlighting always consistent. If there are ways to improve this we can explore them later, but it is not a priority right now and should not be a blocker for other work.

There is a patch for this ticket, unmerged for quite some time. This comment should reflect the current review status.

@Nikerabbit started reviewing the patch and had some back and forth with @santhosh on this ticket, after which the review stalled. Months after, I did review and testing of the patch and came up with mostly the same results as @Nikerabbit.

I used the test example @Nikerabbit came up with and prepared test article:

  • Start translating original revision of Käyttäjä:Nikerabbit/seeäks from fiwiki
  • Translate all the sections with "Copy original content" MT option, so that we can easily see the mapping of sections upon restore
  • After saving, return to dashboard and open the draft again. revision parameter is now gone and we have the latest revision of article as a source now
  • In that latest revision, third section was removed and translation of that third section from the original revision is paired with what was fourth section. The reason for this mismatch is that Parsoid IDs seemingly got reassigned and that is the first criteria by which we try to restore. I don't know if there is something we can do about this and if this is known behavior in Parsoid. Translation of what used to be fourth section in original revision is replaced in DB and there is no way to make use of it.

Other sections are restored properly, but there are some different questions to answer. @santhosh and @Nikerabbit had conversation above on what should happen on restore. Whether all sections should be re-saved again to reflect all the changes in parallel corpora. Currently, we save source of every section the first time changes are made to the corresponding target section. That way, if sections are restored improperly, like in example above, we can have translation of section B paired with source for section A. Note that behavior to save source on first target edit is not being changed in the patch for this ticket, and that without the patch, mismatching of restored sections is only greater, making bigger problems of wrongly paired sections in corpora. The question when we should update the corpora is different. I think we should keep doing it after the first edit for that section and not re-save all the sections upon restore.

Additionally, @Nikerabbit came up with an idea to continue using the old revision as source article when user continues the draft and show the message to user that they are using old revision and may want to switch to the latest one. That is the reverse of what we currently do. Determining how much the new revision is different from the old one (changed significantly) can be hard or at least I don't know of any API that could help, but that can also be the job cxserver can do.

Pginer-WMF added a comment.EditedFeb 25 2019, 4:54 PM

Additionally, @Nikerabbit came up with an idea to continue using the old revision as source article when user continues the draft and show the message to user that they are using old revision and may want to switch to the latest one. That is the reverse of what we currently do. Determining how much the new revision is different from the old one (changed significantly) can be hard or at least I don't know of any API that could help, but that can also be the job cxserver can do.

Restoring a saved translation should be a safe operation. I prefer users to work on an outdated version of an article if there is any doubt that we can reliably load a more updated version of the content, rather than risking the contents to be juggled or lost.

I think we can be flexible on what "changed significantly" means. If that means that there was any change in any source paragraph that the user added to the translation (or even, any change at all in the source article), that's ok. That would make the tool less tolerant to source updates but safer when restoring content. Having said that, I don't consider this to be the opposite of what we do, but just to raise the bar in terms of the kind of modifications that we accept.

What I think it is important from my perspective is to identify (a) which is the safest approach we can support, and (b) which are the limitations that prevents us from supporting a better one (e.g., capturing any proposed improvement to Parsoid for the corresponding team to consider).

santhosh added a comment.EditedFeb 27 2019, 10:57 AM

Thanks for the comments. From what I read, let me summarize the issues:

  1. As per the proposed patch and algorithm, we consider a matching parsoid id as a strong indicator for match. From testing. we see that parsoid ids are not that stable to consider strong indicator.

This is a good observation. If we remove the strong dependency on parsoid id and use the token matching algorithm in conjunction with parsoid id, we can address this. Perhaps we can completely remove the parsoid id check and rely on token matching. I will attempt this and do some tests.

  1. When we restore using our new logic, we need to make sure that the corpora alignment in backend database also updated with the new alignment. We are doing that, but only when the target section get a first edit.

The conflict here is, we have an understanding that restoring a translation itself does not cause any saving, untill a manual edit. If we are ready to give up that, we can mark these restored sections for saving.

  1. There is a proposal to stick with old revision when restoring and optionally use the new version.

This is independent of the ticket and can be done in a different ticket - If we decide to approach the restore in that way. My recommendation is to attempt what we started with this ticket and if we see that it is not improving the situation, go with the proposed suggestion of sticking with old revision.

Not seen in the review, but this patch is a rare patch for CX front end with unit tests. So feel free to suggest any crazy combinations to challenge and test the algorithm.

Thanks @Nikerabbit for preparing a good test content at https://fi.wikipedia.org/w/index.php?title=K%C3%A4ytt%C3%A4j%C3%A4%3ANikerabbit%2Fsee%C3%A4ks&type=revision&diff=17718127&oldid=17718110

In the latest patch set I removed the dependency on parsoid id and using only the token match algorithm. Here is the output for the above revision changes. I kept the parsoid id check for CX1 sections.

Change 460479 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ContentTranslation@master] Content match based section restore improvement

https://gerrit.wikimedia.org/r/460479

We try to find if there is a section that can be restored by comparing the source section to the saved source section (one paired with translated paragraph).
Previously, we tried to match section numbers/Parsoid IDs, but now we compare the similarity of the content. If nothing changes in the source article (there are no new revisions since translation was started) it is expected to work fine.

We should examine what happens when source article changes with some new revisions. I divided the cases in three categories of changes and did some experiments on what can happen.

  • Section deleted from source article: That section will not be restored and counter of unrestored sections increases. Three unrestored sections trigger warning about article changed too much - T168287. This is not considered as content lost.
  • Section moved to a different place: Content has not changed, so it should be restored properly.
  • New section added: Tests from the above comments discussed new section being added with completely distinct content. I have created a test page where I added new sections in second revision, but with content similar to content of some sections which existed in the initial revision, where more than 50% of the words are same. That caused section to be restored at a wrong place. Me and @santhosh discussed this section restoration algorithm and he told me that lost content isn't acceptable, but misplaced one is, because you can copy the content to some different section. There are cases where content can be lost, which leads us to the last scenario.
  • Existing section is modified: Section is restored if source section and saved source section have more than 50% of common words. I have created a test for this. In that test, one section is removed, two new added, but that is already discussed. Those sections that were modified have their content increase by more than double (in terms of word count) or decrease by more than half. In both cases, the translation is lost. I think these are valid cases where section is expanded or cleaned up, and not somewhat of an edge case. @Pginer-WMF, @santhosh what is your take on this?

Existing section is modified: Section is restored if source section and saved source section have more than 50% of common words. I have created a test for this. In that test, one section is removed, two new added, but that is already discussed. Those sections that were modified have their content increase by more than double (in terms of word count) or decrease by more than half. In both cases, the translation is lost. I think these are valid cases where section is expanded or cleaned up, and not somewhat of an edge case. @Pginer-WMF, @santhosh what is your take on this?

Updated the code to take care of this, added tests too.

  • Section deleted from source article: That section will not be restored and counter of unrestored sections increases. Three unrestored sections trigger warning about article changed too much - T168287. This is not considered as content lost.

I'm also curious about this case. Are we treating in the same way the case where (a) the deleted source section has an associated translation and (b) when it does not?

From my perspective:

  • If some sections are removed from the source article but I have not provided a translation for them, I'm totally ok for those to be gone without further notice when I load the translation to continue the work. As a result, I'll get the latest version and my translated content will be preserved.
  • If sections that I translated are gone, then I think it is justified to remain with an older version of the translation instead of the most updated one. As a result, I'll get an old version of the article but my translation content will be preserved.
  • Section deleted from source article: That section will not be restored and counter of unrestored sections increases. Three unrestored sections trigger warning about article changed too much - T168287. This is not considered as content lost.

I'm also curious about this case. Are we treating in the same way the case where (a) the deleted source section has an associated translation and (b) when it does not?

From my perspective:

  • If some sections are removed from the source article but I have not provided a translation for them, I'm totally ok for those to be gone without further notice when I load the translation to continue the work. As a result, I'll get the latest version and my translated content will be preserved.
  • If sections that I translated are gone, then I think it is justified to remain with an older version of the translation instead of the most updated one. As a result, I'll get an old version of the article but my translation content will be preserved.

We always use the latest version of source article when loading into CX. There is an option to use some older revisions, but you need to put revision=12345 in URL, which is not obvious.
Given that, if some section is deleted from source article:

  • If we haven't translated it previously when we were using older revision as source, nothing happens, no work is lost.
  • If we translated the section, which is removed in newer revision, our translation is gone when we load latest revision, as there is no source section with which we can pair it.

Taking into account that you had different assumptions about how system works when source content changes, I took another look and our warning about article being changed significantly (T168287) is not working as expected.

I created a test example with dozen of sections. Translated every section using that initial revision as source. Then, I removed three sections and loaded the draft. Every section from source has a corresponding section in target and translations for three sections which are removed are just lost. Those three unrestored sections are making warning about article being changed significantly show up. Message strings in that warning imply that old revision is still used as source and diff link points to a newer revision and compares it to the latest revision, which makes no sense, since newer version is the latest one and diff shows nothing changed.
Even worse, "Restart the translation" action button restarts translation while using the same revision, just with all the work lost.

Taking into account that you had different assumptions about how system works when source content changes, I took another look and our warning about article being changed significantly (T168287) is not working as expected.

To summarize it in the most concise form, the propose behaviour for restoring is:

  • If the new version of the source can be loaded without causing any data loss to the user translation, load the new version.
  • If the new version cannot be loaded without causing any data loss to the user translation, load the original version and show the warning (T168287).

I think this seems the safest approach. Is there another approach we should consider?

To summarize it in the most concise form, the proposed behavior for restoring is:

  • If the new version of the source can be loaded without causing any data loss to the user translation, load the new version.
  • If the new version cannot be loaded without causing any data loss to the user translation, load the original version and show the warning (T168287).

I think this seems the safest approach. Is there another approach we should consider?

This is not how restoring works currently. We always load the latest revision and try matching the translated sections saved as draft. After that matching finishes, we can know how many sections failed to pair. As part of this ticket, @santhosh proposed changes to that matching algorithm. But, we do that matching client side and know if there is data loss only when we start matching and find it. If we would try doing what you proposed, we'd need to reload the translation using old revision immediately when first section is lost.
We definitely need a new ticket for that, as the process is not working per your specifications and since warning about content being changed too much is broken.

This is not how restoring works currently. We always load the latest revision and try matching the translated sections saved as draft. After that matching finishes, we can know how many sections failed to pair. As part of this ticket, @santhosh proposed changes to that matching algorithm. But, we do that matching client side and know if there is data loss only when we start matching and find it. If we would try doing what you proposed, we'd need to reload the translation using old revision immediately when first section is lost.
We definitely need a new ticket for that, as the process is not working per your specifications and since warning about content being changed too much is broken.

Ok. Thanks for confirming this is not the current behaviour.
From what I read it seems that we currently try to match all paragraphs and let those not matching to just go away (with a warning that tells we are using the old version instead when we are not).

Improving the matching algorithm is already an important improvement, but I think we need to adjust the general approach to make this process more solid. I created a separate ticket: T220790: Restore using updated content only when no user data is lost, using the original version as a fallback

Please, if the proposed behaviour is not clear, or there are any considerations I may not have anticipated please comment in the ticket.

This ticket is about an algorithm to find match between sections. We need to implement first and then discuss the UX and messages around the result of that matching, and edge cases we need to take . If we discuss all at once and keep the iterations blocked, we are not progressing. I had clarified this in https://phabricator.wikimedia.org/T204308#4987605 already.

This ticket is about an algorithm to find match between sections. We need to implement first and then discuss the UX and messages around the result of that matching, and edge cases we need to take . If we discuss all at once and keep the iterations blocked, we are not progressing. I had clarified this in https://phabricator.wikimedia.org/T204308#4987605 already.

@Petar.petkovic considering this as the scope of the current ticket, is there anything based on your review that is missing? Do you have enough detail to continue the review (when available) or anything needs further clarification?

This ticket is about an algorithm to find match between sections. We need to implement first and then discuss the UX and messages around the result of that matching, and edge cases we need to take . If we discuss all at once and keep the iterations blocked, we are not progressing. I had clarified this in https://phabricator.wikimedia.org/T204308#4987605 already.

@Petar.petkovic considering this as the scope of the current ticket, is there anything based on your review that is missing? Do you have enough detail to continue the review (when available) or anything needs further clarification?

Understanding the scope of this ticket is not a problem to progress with this ticket. I'm aware this one is about algorithm to restore the sections, and after I surfaced scenarios which we can face while doing the restoring of the sections in T204308#5079891, conversation drifted into another direction, from which T220790 is the outcome. We're not mixing the scopes of the two tickets, the latter is only the products of the discussions on this one.

Considering only the algorithm for restoring sections, we can have following situation:

  1. We have one section. Let's say its content is "The quikc brown fox jumps over the lazy dog"
  2. This revision is used in Content Translation to translate the section to Spanish and not published (saved as draft)
  3. Meanwhile, the article changes, so that our section becomes "The quick brown fox jumps over the lazy dog" (typo fixed)
  4. After that, section gets expanded and section volume (word count) increases by more than double. Let's say new content is "The quick brown fox jumps over the lazy dog. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse sed aliquam nisl."
  5. Now (in real situation, this can be months away) user gets back to the saved draft and restores it. Translation of section from point #1 of this list is lost.

If that typo fixing from point #3 didn't exist, translation would be successfully restored. In other words, when section gets expanded, it can be restored only if expanding the section means appending/prepeding new text without altering the existing one.

Maybe I'm pushing this algorithm to the extreme, situation is already improved by @santhosh after my previous comments. This isn't unreal scenario what I described here, but I'll let @Pginer-WMF have a word on this.

I have created examples of these workflows:

Maybe I'm pushing this algorithm to the extreme, situation is already improved by @santhosh after my previous comments. This isn't unreal scenario what I described here, but I'll let @Pginer-WMF have a word on this.

Thanks for the clarifications, @Petar.petkovic. If the described scenarios are not better supported with the current approach in production, I think we can consider the proposed patch an improvement worth delivering to the users. We should capture the support for these scenarios as a follow-up ticket to keep improving the approach in future iterations.

Change 460479 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Content match based section restore improvement

https://gerrit.wikimedia.org/r/460479