Change Details

As per the current CX2 code, no section is left out from restore. Even though the restored sections does not always match with the original source section. This is a bug(and blocks {T168287}). Section numbers are used as the first method to restore a section. So if there is an article with 20 sections, all saved, and tried to restore, the first translated section goes to first source, second goes to second source and so on.. The section numbers that are inserted by CX is used as the identifier for the CX2 section. This will work if the source article did not change at all. But If the original source article changed and imagine 4th and 5th section swapped. Currently, the 4th source section still get the 4th saved translation, even though the content of fourth section is original source articles 5th section. Translated sections should not restore based on matching section number. Section numbers are just sequential order of sections in an article. It does not correspond to the content in that section. Parsoid ids tries to be stable with minor content changes across revisions. But in my experience, that stability is very conditional and cannot depend 100% for section restore. There will be sections without a matching parsoid id across revisions of source article. So I propose the following improvement for the section restore. 1. Use parsoid id to locate a source section for the saved translation. For CX1 translations without section wrapping, this is the id of the block tags. For example, `<p id="mwAc">..</p>`. Here the parsoid id for the section is `mwAc`. For the sections with `<section>` tag wrapping, the parsoid id is the id of first immediate child of that section. For a section like `<section rel="cxSection" id="cxSourceSection34"><p id="mwAc">..</p><section>`, the parsoid id is `mwAc`. If any of the saved translation section has parsoid id `mwAc` , then it get restored for a source section with same parsoid id. 2. If the source article changed a lot, we might not see a matching parsoid id in any of the source section. We should NOT use the linear order of the section as fallback, because that is a blind restore. Sometimes a section heading will restored against a figure or paragraph if we do this. Instead, find a source section that has **common tokens **with the saved sections. 1. **Common tokens** are simply the words that are common in source section of new revision of source article and in the source section we saved along with translation. 2. We will define a threshold ratio to say if two sections are very similar and section can be restored or not. I propose ratio greater than 0.5 3. Tokenization is done in the same way we do for section progress. It is based on the text value of section. It is language aware tokenization. So for languages that does not use spaces, tokens are characters. 4. If the old source section and new source section have //different tags//, we can immediately reject the section for restore. 3. If we still did not find a source section for a saved section, proceed with {T168287} **A test case:** # Take https://en.wikipedia.org/wiki/Phantosmia translation to simple english as example. Use 'Source text' as translation method. Do a translation with one of its older revisions. I used a revision that is 1 year old. https://en.wikipedia.org/w/index.php?title=Phantosmia&oldid=800766238. To use this particular revision for source, add revision=800766238 in the translation URL. Example: `title=Special:ContentTranslation&page=Phantosmia&from=en&to=simple&targettitle=Phantosmia&version=2` # Translate all sections. I translated 68 sections. All saved. # Just reload the translation editor. You will see all 68 sections restored. This is because we are using a particular revision, nothing changed in source article. So section number based restore works. # Remove the revision=800766238 from URL and load the translation again. # You will see all sections restored. But with lot of misalignment, heading restored against paragraphs, sections restored against source sections which has no relation etc. # If you do the previous step of loading the translation after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460206 the section alignment will be correct. But the source-translation is not matching at all. # I implemented the above proposed approach in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460479. With that patch, you should get all 68 sections restored without any issue. Here is an example, showing different parsoid id mwCQ, mwCA got restored based on the content match. | Source article | Restored translation | | {F25848146 size=full} | {F25848271 size=full} | The section content is given in below screenshot. You can see that there are some reference changes, which might have caused a new parsoid id. {F25848193 size=full} I also found that if we don't do the common content based matching about 32 sections were not restored at all.

As per the current CX2 code, all sections are restored, even though the restored sections do not always match with the original source section. This is a bug and blocks {T168287}. Currently, section numbers are used to match sections. Section numbers are assigned by cxserver(?) in the order of sections in the source text, starting from 1. So if there is an article with 20 sections with translations, during restoration the first translated section matches the first source section, second translated section matches the second source section and so on. This works well only if the source article did not change at all. But if the original source article changed, imagine 4th and 5th section swapped. Currently, the 4th source section still get the 4th saved translation, even though the content of fourth section is original source articles 5th section. Translated sections should not be restored based on the section number. It does not correspond to the content in that section. Parsoid also assigns ids. Those ids aim to be stable with minor content changes across revisions. But in my (ST) experience, even changes that look small may cause the parsoid id to change. We cannot rely 100% on them either for section restoration. There will be sections without a matching parsoid id across revisions of source article. I propose the following improvement for the section restoration: 1. Use parsoid id to locate a source section for the saved translation. For CX1 translations without section wrapping, this is the id of the block tags. For example, `<p id="mwAc">..</p>`. Here the parsoid id for the section is `mwAc`. For the sections with `<section>` tag wrapping, the parsoid id is the id of first immediate child of that section. For a section like `<section rel="cxSection" id="cxSourceSection34"><p id="mwAc">..</p><section>`, the parsoid id is `mwAc`. If any of the saved translation section has parsoid id `mwAc` , then it get restored for a source section with same parsoid id. 2. If the source article changed a lot, we might not see a matching parsoid id in any of the source sections. We cannot use the linear order of the section as fallback either, because it has the same issue as with section numbers. Sometimes a section heading will restored against a figure or paragraph if we do this. Instead, we need to find a source section that has **common tokens **with the saved sections. 1. **Common tokens** are simply the words that are common in source section of new revision of source article and in the source section we saved along with translation. 2. We will define a threshold ratio to say if two sections are very similar enough to match. I propose threshold > 0.5. 3. Tokenization is done in the same way we do when calculating section progress. It is based on the text value of section. It uses language aware tokenization. So for languages that do not use spaces, tokens are characters. 4. If the old source section and new source section have //different tags// (e.g. `<p>` vs. `<h1>`), we can immediately reject the pair as not matching. 3. If we still did not find a source section for a saved section, proceed with {T168287} **A test case:** # Take https://en.wikipedia.org/wiki/Phantosmia translation to simple english as example. Use 'Source text' as translation method. Do a translation with one of its older revisions. I used a revision that is 1 year old. https://en.wikipedia.org/w/index.php?title=Phantosmia&oldid=800766238. To use this particular revision for source, add revision=800766238 in the translation URL. Example: `title=Special:ContentTranslation&page=Phantosmia&from=en&to=simple&targettitle=Phantosmia&version=2` # Translate all sections. I translated 68 sections. All saved. # Just reload the translation editor. You will see all 68 sections restored. This is because we are using a particular revision, nothing changed in source article. So section number based restore works. # Remove the revision=800766238 from URL and load the translation again. # You will see all sections restored. But with lot of misalignment, heading restored against paragraphs, sections restored against source sections which has no relation etc. # If you do the previous step of loading the translation after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460206 the section alignment will be correct. But the source-translation is not matching at all. # I implemented the above proposed approach in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460479. With that patch, you should get all 68 sections restored without any issue. Here is an example, showing different parsoid id mwCQ, mwCA got restored based on the content match. | Source article | Restored translation | | {F25848146 size=full} | {F25848271 size=full} | The section content is given in below screenshot. You can see that there are some reference changes, which might have caused a new parsoid id. {F25848193 size=full} I also found that if we don't do the common content based matching then about 32 sections were not restored.

As per the current CX2 code, noall section is left out froms are restore.d, Eveneven though the restored sections does not always match with the original source section. This is a bug(and blocks {T168287}). Section numbers are used as the first method to restore a section. So if there is an article with 20 sections, and blocks {T168287}. Currently, section numbers are used to match sections. all savedSection numbers are assigned by cxserver(?) in the order of sections in the source text, and tried to restore,starting from 1. the first translatedSo if there is an article with 20 section goes to first sources with translations, second goes toduring restoration the first translated secondtion matches the first source and so on..section, The section numbers that are insersecond translated by CX is used asection matches the identifier for the CX2 sectionsecond source section and so on. This will workorks well only if the source article did not change at all. But Ifif the original source article changed and, imagine 4th and 5th section swapped. Currently, the 4th source section still get the 4th saved translation, even though the content of fourth section is original source articles 5th section. Translated sections should not be restored based on matching section numberthe section number. It does not correspond to the content in that section. Section numbers are just sequential order of sections in an articleParsoid also assigns ids. It does not correspond to theThose ids aim to be stable with minor content in that sectionchanges across revisions. Parsoid ids tries to be stable with minor content changes across revisions.But in my (ST) experience, But in my experience,even changes that look small may cause the parsoid id to change. that stability is very conditional and cannot depend 100%We cannot rely 100% on them either for section restoreation. There will be sections without a matching parsoid id across revisions of source article. So I propose the following improvement for the section restore.ation: 1. Use parsoid id to locate a source section for the saved translation. For CX1 translations without section wrapping, this is the id of the block tags. For example, `<p id="mwAc">..</p>`. Here the parsoid id for the section is `mwAc`. For the sections with `<section>` tag wrapping, the parsoid id is the id of first immediate child of that section. For a section like `<section rel="cxSection" id="cxSourceSection34"><p id="mwAc">..</p><section>`, the parsoid id is `mwAc`. If any of the saved translation section has parsoid id `mwAc` , then it get restored for a source section with same parsoid id. 2. If the source article changed a lot, we might not see a matching parsoid id in any of the source sections. We should NOTWe cannot use the linear order of the section as fallback either, because that is a blind restoreit has the same issue as with section numbers. Sometimes a section heading will restored against a figure or paragraph if we do this. Instead, we need to find a source section that has **common tokens **with the saved sections. 1. **Common tokens** are simply the words that are common in source section of new revision of source article and in the source section we saved along with translation. 2. We will define a threshold ratio to say if two sections are very similar and section can be restored or notenough to match. I propose ratio greater thanthreshold > 0.5. 3. Tokenization is done in the same way we do forwhen calculating section progress. It is based on the text value of section. It iIt uses language aware tokenization. So for languages that does not use spaces, tokens are characters. 4. If the old source section and new source section have //different tags// (e.g. `<p>` vs. `<h1>`), we can immediately reject the section for restorepair as not matching. 3. If we still did not find a source section for a saved section, proceed with {T168287} **A test case:** # Take https://en.wikipedia.org/wiki/Phantosmia translation to simple english as example. Use 'Source text' as translation method. Do a translation with one of its older revisions. I used a revision that is 1 year old. https://en.wikipedia.org/w/index.php?title=Phantosmia&oldid=800766238. To use this particular revision for source, add revision=800766238 in the translation URL. Example: `title=Special:ContentTranslation&page=Phantosmia&from=en&to=simple&targettitle=Phantosmia&version=2` # Translate all sections. I translated 68 sections. All saved. # Just reload the translation editor. You will see all 68 sections restored. This is because we are using a particular revision, nothing changed in source article. So section number based restore works. # Remove the revision=800766238 from URL and load the translation again. # You will see all sections restored. But with lot of misalignment, heading restored against paragraphs, sections restored against source sections which has no relation etc. # If you do the previous step of loading the translation after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460206 the section alignment will be correct. But the source-translation is not matching at all. # I implemented the above proposed approach in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460479. With that patch, you should get all 68 sections restored without any issue. Here is an example, showing different parsoid id mwCQ, mwCA got restored based on the content match. | Source article | Restored translation | | {F25848146 size=full} | {F25848271 size=full} | The section content is given in below screenshot. You can see that there are some reference changes, which might have caused a new parsoid id. {F25848193 size=full} I also found that if we don't do the common content based matching then about 32 sections were not restored at all.