As per the current CX2 code, all sections are restored, even though the restored sections do not always match with the original source section. This is a bug and blocks T168287: CX2: Warning about article having changed too much.
Currently, section numbers are used to match sections. Section numbers are assigned by cxserver(?) in the order of sections in the source text, starting from 1. So if there is an article with 20 sections with translations, during restoration the first translated section matches the first source section, second translated section matches the second source section and so on. This works well only if the source article did not change at all.
But, if the original source article changed, imagine 4th and 5th section swapped. Currently, the 4th source section still gets the 4th saved translation, even though the content of fourth section is original source article's 5th section.
Translated sections should not be restored based on the section number. It does not correspond to the content in that section. Parsoid also assigns ids. Those ids aim to be stable with minor content changes across revisions. But in my (@santhosh) experience, even changes that look small may cause the parsoid id to change. We cannot rely 100% on them either for section restoration. There will be sections without a matching parsoid id across revisions of source article.
I propose the following improvement for the section restoration:
- Use parsoid id to locate a source section for the saved translation. For CX2 translations without section wrapping, this is the id of the block tags. For example, <p id="mwAc">..</p>. Here the parsoid id for the section is mwAc. For the sections with <section> tag wrapping, the parsoid id is the id of first immediate child of that section. For a section like <section rel="cxSection" id="cxSourceSection34"><p id="mwAc">..</p><section>, the parsoid id is mwAc. If any of the saved translation section has parsoid id mwAc , then it get restored for a source section with same parsoid id.
- If the source article changed a lot, we might not see a matching parsoid id in any of the source sections. We cannot use the linear order of the section as fallback either, because it has the same issue as with section numbers. Sometimes a section heading will restored against a figure or paragraph if we do this. Instead, we need to find a source section that has common tokens with the saved sections.
- Common tokens are simply the words that are common in source section of new revision of source article and in the source section we saved along with translation.
- We will define a threshold ratio to say if two sections are very similar enough to match. I propose threshold > 0.5.
- Tokenization is done in the same way we do when calculating section progress. It is based on the text value of section. It uses language aware tokenization. So for languages that do not use spaces, tokens are characters.
- If the old source section and new source section have different tags (e.g. <p> vs. <h1>), we can immediately reject the pair as not matching.
- If we still did not find a source section for a saved section, proceed with T168287: CX2: Warning about article having changed too much
A test case:
- Take https://en.wikipedia.org/wiki/Phantosmia translation to simple english as example. Use 'Source text' as translation method. Do a translation with one of its older revisions. I used a revision that is 1 year old. https://en.wikipedia.org/w/index.php?title=Phantosmia&oldid=800766238. To use this particular revision for source, add revision=800766238 in the translation URL. Example: title=Special:ContentTranslation&page=Phantosmia&from=en&to=simple&targettitle=Phantosmia&version=2
- Translate all sections. I translated 68 sections. All saved.
- Just reload the translation editor. You will see all 68 sections restored. This is because we are using a particular revision, nothing changed in source article. So section number based restore works.
- Remove the revision=800766238 from URL and load the translation again.
- You will see all sections restored. But with lot of misalignment, heading restored against paragraphs, sections restored against source sections which has no relation etc.
- If you do the previous step of loading the translation after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460206 the section alignment will be correct. But the source-translation is not matching at all.
- I implemented the above proposed approach in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460479. With that patch, you should get all 68 sections restored without any issue.
Here is an example, showing different parsoid id mwCQ, mwCA got restored based on the content match.
Source article | Restored translation |
The section content is given in below screenshot. You can see that there are some reference changes, which might have caused a new parsoid id.
I also found that if we don't do the common content based matching then about 32 sections were not restored.