As per the current CX2 code, no section is left out from restore. Even though the restored sections does not always match with the original source section. This is a bug(and blocks {T168287}). Section numbers are used as the first method to restore a section.
So if there is an article with 20 sections, all saved, and tried to restore, the first translated section goes to first source, second goes to second source and so on.. The section numbers that are inserted by CX is used as the identifier for the CX2 section. This will work if the source article did not change at all.
But If the original source article changed and imagin 4th and 5th section swapped. Currently, the 4th source section still get the 4th saved translation, even though the content of fourth section is original source articles 5th section.
Translated sections should not restore based on matching section number. Section numbers are just sequential order of sections in an article. It does not correspond to the content in that section. Parsoid ids tries to be stable with minor content changes across revisions. But in my experience, that stability is very conditional and cannot depend 100% for section restore. There will be sections without a matching parsoid id across revisions of source article.
So I propose the following improvement for the section restore.
1. Use parsoid id to locate a source section for the saved translation. For CX1 translations without section wrapping, this is the id of the block tags. For example, `<p id="mwAc">..</p>`. Here the parsoid id for the section is `mwAc`. For the sections with `<section>` tag wrapping, the parsoid id is the id of first immediate child of that section. For a section like `<section re;="cxSection" id="cxSourceSection34"><p id="mwAc">..</p><section>`, the parsoid id is `mwAc`. If any of the saved translation section has parsoid id `mwAc` , then it get restored for a source section with same parsoid id.
2. If the source article changed a lot, we might not see a matching parsoid id in any of the source section. We should NOT use the linear order of the section as fallback, because that is a blind restore. Sometimes a section heading will restored against a figure or paragraph if we do this. Instead, find a source section that has **common tokens **with the saved sections.
1. **Common tokens** are simply the words that are common in source section of new revision of source article and in the source section we saved along with translation.
2. We will define a theshold ratio to say if two sections are very similar and section can be restored or not. I propose ratio greater that 0.5
3. Tokenization is done in the same way we do for section progress. It is based on the text value of section. It is language aware tokenization. So for languages that does not use spaces, tokens are characters.
4. If the old source section and new source section has //differering tags//, we can immediately reject the section for restore.
3. If we still not finding a source section for a saved section, proceed with {T168287}
**A test case:**
# Take https://en.wikipedia.org/wiki/Phantosmia translation to simple english as example. Use 'Source text' as translation method. Do a translation with one of its older revisions. I used a revision that is 1 year old. https://en.wikipedia.org/w/index.php?title=Phantosmia&oldid=800766238. To use this particular revision for source, add revision=800766238 in the translation URL. Example: `title=Special:ContentTranslation&page=Phantosmia&from=en&to=simple&targettitle=Phantosmia&version=2`
# Translate all sections. I translated 68 sections. All saved.
# Just reload the translation editor. You will see all 68 sections restored. This is because we are using a particular revision, nothing changed in source article. So section number based restore works.
# Remove the revision=800766238 from URL and load the translation again.
# You will see all sections restored. But with lot of mis alignment, heading restored against paragraphs, sections restored against source sections which has no relation etc.
# If you do the previous step of loading the translation after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460206 the section alignment will be correct. But the source-translation is not matching at all.
# I implemented the above proposed approach in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/460479, With that patch, you should get all 68 sections restored without any issue.
Here is an example, showing different parsoid id mwCQ, mwCA got restored based on the content match.
| Source article | Restored translation |
| {F25848146 size=full} | {F25848271 size=full} |
The section content is given in below screenshot. You can see that there are some reference changes, which might have caused a new parsoid id.
{F25848193 size=full}
I also found that if we don't do the common content based matching about 32 sections were not restored at all.