Incorporate translated sections into the parallel corpora when published
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	Pginer-WMF
	Dec 18 2020, 3:23 PM

Description

When an article is published with Content Translation, contents are also publicly exposed as part of the parallel corpora. In this way, anyone can use APIs or data dumps to get information about the translated paragraphs (original content, initial MT, user modifications, etc.)

We want sections published with SectionTranslation to also contribute to this useful data resource.

Metadata changes

As part of the work in this ticket we need to define any changes in the metadata to distinguish section translation from article translation, an how they impact external services using the data (e.g., Opus project). In particular, the current data schema assumes there will be IDs for the translation and the translator for each translation. This does not align with the current support for Section Translation where translations are not persisted yet and anonymous translation may be supported in the future.

QA Notes
This ticket does not result in many visible changes to the user, we may want to verify that:

The usual publishing process is not broken. Making some translations and verifying that they could be published without issues.
Data is available in the corpora. By inspecting the database, check that information for the previous translations was added to the corpora.
Published translations with Section Translation should become visible in the "Published" view of Content Translation for the same user.

Details

Subject	Repo	Branch	Lines +/-
CX3 Build 0.2.0+20220718	mediawiki/extensions/ContentTranslation	master	+119 -117
SX parallel corpora: Fix parallelCorporaMTContent calculation	mediawiki/extensions/ContentTranslation	master	+6 -2
CX3 Build 0.2.0+20220629	mediawiki/extensions/ContentTranslation	master	+287 -73
SX: Add request to "sxsave" api inside publishTranslation action	mediawiki/extensions/ContentTranslation	master	+217 -8
Add saveTranslation method inside translatorAPI module	mediawiki/extensions/ContentTranslation	master	+70 -0
SX: Add sxsave API action	mediawiki/extensions/ContentTranslation	master	+339 -2
SX: Add translationUnitPayload DTO and parallelCorporaUnits getter	mediawiki/extensions/ContentTranslation	master	+156 -53
SX edit before publishing: Nest subsections into <section> elements	mediawiki/extensions/ContentTranslation	master	+175 -40
SX: Store the MT provider used for the applied translation	mediawiki/extensions/ContentTranslation	master	+20 -6
SX subSection model: Add translationOrigin getter	mediawiki/extensions/ContentTranslation	master	+54 -0
Add table for section translation	mediawiki/extensions/ContentTranslation	master	+63 -0

Related Objects
Search...

Status	Assigned	Task
Open	None	T243495 [Epic] Support for Section Translation
Open	None	T252542 Section Translation Editor: Preview and publish
Resolved	ngkountas	T270499 Incorporate translated sections into the parallel corpora when published

Event Timeline

Pginer-WMF created this task.Dec 18 2020, 3:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 18 2020, 3:23 PM

Pginer-WMF triaged this task as Medium priority.Dec 18 2020, 3:24 PM

Pginer-WMF added a parent task: T252542: Section Translation Editor: Preview and publish.

Pginer-WMF added subscribers: ngkountas, santhosh.Dec 18 2020, 3:27 PM

Pginer-WMF edited projects, added Language-Team (Language-2021-January-March); removed Language-Team (Language-2020-October-December).Jan 4 2021, 1:19 PM

Pginer-WMF moved this task from Backlog to General infrastructure on the SectionTranslation board.Jan 29 2021, 9:42 AM

Pginer-WMF moved this task from Quarter Backlog to Priority Backlog on the Language-Team (Language-2021-January-March) board.Feb 1 2021, 11:22 AM

Pginer-WMF moved this task from Priority Backlog to Quarter Backlog on the Language-Team (Language-2021-January-March) board.Mar 24 2021, 11:43 AM

Pginer-WMF updated the task description. (Show Details)Mar 24 2021, 11:49 AM

Pginer-WMF edited projects, added Language-Team (Language-2021-April-June); removed Language-Team (Language-2021-January-March).Apr 1 2021, 4:16 PM

As we need to adjust the parallel corpora produced for this task we may want to consider also fixing T245607: CX Published parallel corpus is invalid json

Pginer-WMF edited projects, added Language-Team (Language-2021-July-September); removed Language-Team (Language-2021-April-June).Jul 12 2021, 3:27 PM

Pginer-WMF edited projects, added Language-Team (Language-2021-October-December); removed Language-Team (Language-2021-July-September).Oct 1 2021, 12:36 PM

Pginer-WMF moved this task from Quarter Backlog to Priority: SX Adoption on the Language-Team (Language-2021-October-December) board.Oct 1 2021, 12:55 PM

Some notes on the required database changes:

Requirements

There should be a way to specifiy a translation is "section translation" and not the typical full article translation
Section translation is available for anonymous users(by design) so translator should be optional
If we can associatate a section translation with a logged in user, we can use that information for stats dashboards.
Saving/Restoring section translation is not in the plans, but having the user id - translated content will keep that option active for consideration in future.
Publishing currently involves publish api accepting the current translated content. Parallel corpus table will require source content, mt information, progress information too passed to API.
Should we keep a record when translation starts for section translation? Or there is only one time database insertion with status "published"

Existing tables

cx_translations	cx_corpora

Current cx_corpora table need a translation_id and it refers to cx_translations table. The quick way to support just saving parallel corpus is to make it nullable. But that is not recommended as we are not associating important metadata(languages, titles, translator)

Database changes

Currently the cx_translations table has translation_started_by and translation_last_update_by fields nullable. So no database changes would be needed to support anonymous trnaslations. But there may be assumptions in the database access layer(PHP classes) about logged in status of user. That will require review
A new field translation_type may be required to cx_ranslations table so that we can mark a particular translation as section translation translation_type enum('article', 'section') default null
The cx_corpora tables does not need any changes.

Pginer-WMF raised the priority of this task from Medium to High.Nov 25 2021, 11:39 AM

Pginer-WMF moved this task from Priority: SX Adoption to Quarter Backlog on the Language-Team (Language-2021-October-December) board.Dec 7 2021, 8:53 AM

Pginer-WMF edited projects, added Language-Team (Language-2022-January-March); removed Language-Team (Language-2021-October-December).Jan 10 2022, 3:40 PM

Pginer-WMF moved this task from Quarter Backlog to Priority: SX Adoption on the Language-Team (Language-2022-January-March) board.

Additional thoughts:

Since the section translations are done by "sentence by sentence", there is a possible situation of one sentence done by Google MT, and another by Apertium. Since the cx_corpora table has section level information(the cx_origin column captures MT engine), we cannot capture the multiple MT engine for single section information.
The cx_translations table has an assumption that one article is translated by one person. But in Section translation, 10 sections can be translated by 10 translators in different times. The translation_status column will get very confusing meaning too. There is a unique index for the combination of translation_source_title, translation_source_language, translation_target_language and translation_started_by columns. That wont be unique for section translaiton since additional info-section title will decide the uniqueness.

I am starting to think that a new table for capturing section translation may be better than reusing current cx_translations table for section translation.

Table design(draft):

Table name: cx_sx_translations

Column name	Type	description
translation_id	int autoincrement	Autogenerated primary key. cxc_corpora table will have pointer to this. But how that table decide whether it is section translation id or translation id?
translation_source_title	varbinary(512) not null	Source article title
translation_target_title	varbinary(512) not null	Target aritcle title
translation_source_language	varbinary(36) not null	Source language code
translation_target_language	varbinary(36) not null	Target language code
translation_source_section_title	varbinary(512) not null	Source section title. What happens for lead section?
translation_target_section_title	varbinary(512) not null	Target section title. What happens for lead section?
translation_source_revision_id	int unsigned	Revision id of source article
translation_target_revision_id	int unsigned	Revision id of target article after publishing the section
translation_source_url	text binary not null	Source article URL
translation_target_url	text binary default null	Target article URL. May not exist when doing lead section translation
translation_status	enum('draft', 'published', 'deleted') default null	Are these enums needed? Futuristic?
translation_start_timestamp	varchar(14) binary not null	Start date of this translation
translation_last_updated_timestamp	varchar(14) binary not null	Last updated date of this translation
translation_started_by	int	Who started translation. Allows anon translation by making it nullable. Qn: IP masking?
translation_last_updated_by	int	Who did the last translation? It need not be the translator who started, but will we allow this feature?
translation_progress	TINYBLOB not null	MT usage information as JSON

Good points, Santhosh. Some comments below:

In T270499#7647593, @santhosh wrote:

Additional thoughts:

Since the section translations are done by "sentence by sentence", there is a possible situation of one sentence done by Google MT, and another by Apertium. Since the cx_corpora table has section level information(the cx_origin column captures MT engine), we cannot capture the multiple MT engine for single section information.

This is a good point. Just to make sure we consider different options, would it make sense any of the following:

a) If all sentences of a paragraph are translated using Apertium, mark the paragraph as being translated with Apertium. If some sentences are translated by OpusMT and others by Apertium, then mark the paragraph as "multiple MTs". This makes the data less precise in some cases, but depending on how often the "mixed" MT case happens it may be an acceptable compromise (depending on the cons of the alternatives).
b) For the mobile experience register each sentence as a separate unit. This may prevent from capturing broader paragraph rewrites that break sentence boundaries.

The cx_translations table has an assumption that one article is translated by one person. But in Section translation, 10 sections can be translated by 10 translators in different times. The translation_status column will get very confusing meaning too. There is a unique index for the combination of translation_source_title, translation_source_language, translation_target_language and translation_started_by columns. That wont be unique for section translation since additional info-section title will decide the uniqueness.

This assumption has been problematic for Content Translation causing people to be blocked from translating. So I see this more as a reason to fix the current schema (T298244) to benefit both Content and Section translation rather than creating a separate schema without the issue.

Also, more generally: Content and Section Translation will converge over time making it possible to (a) create and (b) expand articles on (c) desktop (paragraph by paragraph) and (d) mobile (sentence by sentence). If two schemas are created we need to consider which combinations will use which. It is not clear whether the CX one would in that case include only new article creation on desktop (a+c) or will be expanded to include sections translated on desktop (b+c). The later will be done paragraph by paragraph but capturing the section translated will still be needed.
My point is that for a proposal involving splitting the schema we need to check which usecases will be captured in each one.

In T270499#7647792, @Pginer-WMF wrote:

Since the section translations are done by "sentence by sentence", there is a possible situation of one sentence done by Google MT, and another by Apertium. Since the cx_corpora table has section level information(the cx_origin column captures MT engine), we cannot capture the multiple MT engine for single section information.

This is a good point. Just to make sure we consider different options, would it make sense any of the following:

a) If all sentences of a paragraph are translated using Apertium, mark the paragraph as being translated with Apertium. If some sentences are translated by OpusMT and others by Apertium, then mark the paragraph as "multiple MTs". This makes the data less precise in some cases, but depending on how often the "mixed" MT case happens it may be an acceptable compromise (depending on the cons of the alternatives).

(I discussed this with engineers in the team today)

I think this problem may not be as important as I thought. The corpus table captures upto three rows per section

The original untranslated source section content(cxc_origin is "source"). This get saved to database when placeholder is clicked in CX. In SX, this will get saved later - when publishing happens. Because there is no "auto save " in SX.
The unmodified machien translation(cxc_origin is "Google", "Apertium" etc). This get saved as part of auto save, when MT is arrived to user
The final edited translation by user.(cxc_origin is "user"). This get saved as part of auto save in CX. In SX this happens when publishing.

As we can see, if there is no auto save feature in SX, step 2 won't happen. This means in parallel corpus table there will be only 2 records per section. One for orignal content, One for final translation. There won't be information about particual MT engines used in between. But please confirm once again that we are not going to save unmodified MT for section translation. This difference is important.

b) For the mobile experience register each sentence as a separate unit. This may prevent from capturing broader paragraph rewrites that break sentence boundaries.

This was discussed but found a bit problematic. Saving sentences to corpus need extra step since sentences are not segmented when publishing. We also provide the whole paragraph edit before publishing. The final cleaned up HTML is used at publishing step.

Second reason is, sentences level corpus less useful in machine learning because sentences usually refer to previous sentences in paragraph. For example" X is a formerl officail in Y country. He is also winner of Z award". The "he" in second sentence need first sentence for context(Anaphora resolution). Parallel corpus is better if it has more context.

This may prevent from capturing broader paragraph rewrites that break sentence boundaries.

I did not understand this issue. Please elaborate.

The cx_translations table has an assumption that one article is translated by one person. But in Section translation, 10 sections can be translated by 10 translators in different times. The translation_status column will get very confusing meaning too. There is a unique index for the combination of translation_source_title, translation_source_language, translation_target_language and translation_started_by columns. That wont be unique for section translation since additional info-section title will decide the uniqueness.

This assumption has been problematic for Content Translation causing people to be blocked from translating. So I see this more as a reason to fix the current schema (T298244) to benefit both Content and Section translation rather than creating a separate schema without the issue.

The current schema supports multiple people doing same article translation. The unique translation is combination of languages+source title+translator. So another translator means another valid entry in database. This is how we allow translation of article even if some other translator started earlier and deleted it. The T298244: Remove technical limit that prevents different users to translate the same topic is mostly coordination, communication, effort conflict issues than table design. Due to this issues, we had custom logic in applicaiton logic to prevent parallel translation.
(Also, not to be confused with collaborative translation - two people working simultaneously or not on same translation content)

In the case of SX, the languages+source title+translator wont work since same translator can do another section with same language, title combination.

Also, more generally: Content and Section Translation will converge over time making it possible to (a) create and (b) expand articles on (c) desktop (paragraph by paragraph) and (d) mobile (sentence by sentence). If two schemas are created we need to consider which combinations will use which. It is not clear whether the CX one would in that case include only new article creation on desktop (a+c) or will be expanded to include sections translated on desktop (b+c). The later will be done paragraph by paragraph but capturing the section translated will still be needed.
My point is that for a proposal involving splitting the schema we need to check which usecases will be captured in each one.

(Terminology clarification: What is being discussed is extra database table in the same database schema. Not two schema).

The usecase you meantioned is taken in to the consideration and possible with the design. To elaborate, Article creation(by CX or SX) and later expansion(by SX or CX) by same or muliple users is possible with this design. We are just capturing the section titles information in a new table since it is a one-to-many relation for a languages+title combination in cx_translations records.

The draft table in previous comment need further modification per our meeting today. I will post a new version here.

Thanks for the additional details and clarifications, @santhosh. Some more comments below:

As we can see, if there is no auto save feature in SX, step 2 won't happen. This means in parallel corpus table there will be only 2 records per section. One for orignal content, One for final translation. There won't be information about particual MT engines used in between. But please confirm once again that we are not going to save unmodified MT for section translation. This difference is important.

The lack of "auto-save" was intended as temporary. The plan is to support persistence for in-progress translations in Section Translation. The goal is to support users to translate a section in multiple sessions which will allow them to better deal with interruptions that are common on mobile and provide a more consistent experience across browsers.

Given that section translation breaks the process into smaller steps, saving may be triggered as part of existing actions (e.g., when applying a proposed translation or completing the editing of a sentence). So it may work differently, but it should support returning to an in-progress translation to continue the work before publishing.

This may prevent from capturing broader paragraph rewrites that break sentence boundaries.

I did not understand this issue. Please elaborate.

I was referring to the issue you described as " We also provide the whole paragraph edit before publishing" and complications of supporting it with the approach (b).

The usecase you meantioned is taken in to the consideration and possible with the design. To elaborate, Article creation(by CX or SX) and later expansion(by SX or CX) by same or muliple users is possible with this design. We are just capturing the section titles information in a new table since it is a one-to-many relation for a languages+title combination in cx_translations records.

Perfect. Thanks for confirming. I'm happy with any internal organization of the databases that you recommend. Just wanted to make sure that future scenarios that we don't support yet do not create more issues than needed.

Pginer-WMF mentioned this in T298244: Remove technical limit that prevents different users to translate the same topic.Jan 25 2022, 11:46 AM

Pau pointed out the unified version of CX+SX, which is an important input to the design. And I was not happy with the duplication article, language metadata in my previous draft. So I re-approached the issue and modeled this as a table to capture one-to-many relation of of article->sections.

Also, the lack of auto-save is not permanent and we should not base our assumptions on it.

Here is a modified proposal based on that.

New table: cx_section_translations

Column	Type	Description
section_translation_id	int autoincrement	Autogenerated primary key.
translation_id	int	translation_id from cx_translations table.
section_id	varbinary(512)	Same id will be used in cx_corpus table
source_section_title	varbinary(512) not null	Source section title.
target_section_title	varbinary(512) not null	Target section title.

This table avoids the duplication of language, article title, revision in new table. It also avoids the need to modify cx_corpus table.

Scenario 1: A translator uses content translation to create a new article by using paragraph-by-paragraph translation

In this case a new row is added to cx_translations table as we already do for cx2. No data goes to cx_section_translations table.
Parallel corpus goes to cx_corpus table as usual

Scenario 2: A translator uses section translation to create a new article by lead article translation.

Does not matter whether the user uses mobile or desktop. Does not matter whether user use future intergrated SX+CX version of the tool to do this. User may even use the existing CX2 for this.

Steps as per Scenario 1 happens. No data goes to cx_section_translations table.

Scenario 3: A translator uses section translation to expand an already existing article by translating a single section

The existing article could be translated earlier by another translator or freshly created by other editors.

In this case a new row is added to cx_translations table as we already do for cx2. A new translation_id is generated.
A new row is added to cx_section_translations table with the above translation_id, auto generated section_translation_id, section_id(same as we use in cx_corpus table), section titles for source and target languages.
Parallel corpus goes to cx_corpus table as usual

Scenario 4: A translator uses paragraph-by-paragraph translation to create an article. Same translator expand it using SX later

Database records exist for this translator as per Scenario 3. Here we add a new row to cx_section_translations table with the existing translation_id, auto generated section_translation_id, section_id(same as we use in cx_corpus table), section titles for source and target languages.

When new section is published, translation_target_revision_id in cx_translations table updated.

Scenario 5: A translator uses SX to add a section. After publishing it, decides to add another section using SX

Database records exist for this translator as per Scenario 3.
Here we add a new row to cx_section_translations table with the existing translation_id, auto generated section_translation_id, section_id(same as we use in cx_corpus table), section titles. So, there is 2 rows for same translation_id in cx_section_translations table, with different section_id

Scenario 6: A translator uses SX to add a section. Another translator uses SX to add a new section

In this case a new row is added to cx_translations table for new translator as we already do for cx2. A new translation_id is generated.
We add a new row to cx_section_translations table with the existing translation_id, auto generated section_translation_id, section_id(same as we use in cx_corpus table), section titles.

Scenario 7: A translator uses SX+CX unified version to add a section

Same as Scenario 3

Scenario 8: A translator uses SX to add a section. After publishing it, decides to add another section using SX+CX unified version

Seme as Scenario 5

Anonymous translations

translation_started_by and translation_updated_by are nullable

Sections, Autosave and section_id

It is important to recollect the definitions of sections here to identify the section_id that goes to the tables.

A Mediawiki section is a logical unit in the article. For example, History is a section, Personal Life is another etc. Such a section will have one or more paragraphs. Let us call this mw-section for brevity.
A CX section is the minimal translatable unit. It is a paragraph, a block image, a table, a list, an infobox etc. Let us refer it as cx-section.

Cxserver, when paring the article and doing section wrapping for the purpose of CX, clearly marks the above definitions in the HTML nodes as illustrated in the below example. You can see that the whole content here is from "History" section of article "Circle". It is the 4th section in the article and hence cxserver adds data-mw-section-number as 4. Under this section there are several paragraphs, images. For each of these cxserver wraps under <section> tag with id attribute cxSourceSection18, cxSourceSection19 and so on.

<section data-mw-section-number="4" id="cxSourceSection18" rel="cx:Section">
    <h2 id="0e769600933790607b2a13b33ddfad"><span class="cx-segment" data-segmentid="730">History</span></h2>
</section>
<section data-mw-section-number="4" id="cxSourceSection19" rel="cx:Section">
    <figure class="mw-halign-right" id="mwcA" rel="cx:Figure" typeof="mw:Image/Thumb"><a
            href="./File:God_the_Geometer.jpg" id="mwcQ"><img data-file-height="1705" data-file-type="bitmap"
                data-file-width="1244" decoding="async" height="274" id="mwcg" resource="./File:God_the_Geometer.jpg"
                src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/God_the_Geometer.jpg/200px-God_the_Geometer.jpg"
                srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/God_the_Geometer.jpg/300px-God_the_Geometer.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/4d/God_the_Geometer.jpg/400px-God_the_Geometer.jpg 2x"
                width="200" /></a>
        <figcaption id="mwcw"><span class="cx-segment" data-segmentid="731">\nThe <a class="mw-redirect cx-link"
                    data-linkid="732" href="./Compass_(drafting)" id="mwdA" rel="mw:WikiLink"
                    title="Compass (drafting)">compass</a> in this 13th-century manuscript is a symbol of God's act of
                <a class="cx-link" data-linkid="733" href="./Creation_myth" id="mwdQ" rel="mw:WikiLink"
                    title="Creation myth">Creation</a>. </span><span class="cx-segment" data-segmentid="734">Notice also
                the circular shape of the <a class="cx-link" data-linkid="735" href="./Halo_(religious_iconography)"
                    id="mwdg" rel="mw:WikiLink" title="Halo (religious iconography)">halo</a>.</span></figcaption>
    </figure>
</section>
<section data-mw-section-number="4" id="cxSourceSection20" rel="cx:Section">
    <p id="mwdw">
         <span class="cx-segment" data-segmentid="736">The word <i id="mweA">circle</i> derives from the <a    class="cx-link" data-linkid="737" href="./Greek_language" id="mweQ" rel="mw:WikiLink" title="Greek language">Greek</a> κίρκος/κύκλος (<i id="mweg">kirkos/kuklos</i>), itself a <a
                class="cx-link" data-linkid="738" href="./Metathesis_(linguistics)" id="mwew" rel="mw:WikiLink"
                title="Metathesis (linguistics)">metathesis</a> of the <a class="cx-link" data-linkid="739"
                href="./Homeric_Greek" id="mwfA" rel="mw:WikiLink" title="Homeric Greek">Homeric Greek</a> κρίκος (<i
                id="mwfQ">krikos</i>), meaning "hoop" or "ring".<sup about="#mwt50" class="mw-ref reference"
                data-mw="{&#34;name&#34;:&#34;ref&#34;,&#34;attrs&#34;:{},&#34;body&#34;:{&#34;id&#34;:&#34;mw-reference-text-cite_note-3&#34;}}"
                id="cite_ref-3" rel="dc:references" typeof="mw:Extension/ref"><a href="./Circle#cite_note-3" id="mwfg"
                    style="counter-reset: mw-Ref 3;"><span class="mw-reflink-text" id="mwfw">[3]</span></a></sup>
        </span>

      <span class="cx-segment" data-segmentid="740">The origins of the words <i id="mwgA"><a class="cx-link"
                    data-linkid="741" href="./Circus" id="mwgQ" rel="mw:WikiLink" title="Circus">circus</a></i> and <i
                id="mwgg"><a class="extiw" href="https://en.wiktionary.org/wiki/circuit" id="mwgw"
                    rel="mw:WikiLink/Interwiki" title="wikt:circuit">circuit</a></i> are closely related.</span>
</p>
</section>

In paragraph by paragraph model of translation(current CX2/desktop), the translator translates these cxSourceSection18, cxSourceSection19 as separate sections.

In sentence-by-sentence translation(SX/mobile), we present the entire "History" section to the user. Multiple paragraphs are presented and then the translator translates a sentence at a time. In other words, a section in SX is same as mw-sections Section.

When autosave is present, in CX2, changed cx-sections are saved in cx_corpus. The same used for restoring too.

But for sentence-by-sentence model, when we save the section to cx_corpus table as part of publishing or in planned autosave, what should get saved at what granularity?

At present it may look like mw-sections as a whole can be saved to a single record in cx_corpus table. That is more convenient in SX too. But if we think about upcoming unification of CX and SX, we will need to translate a single mw-section using sentence by sentence mode and paragraph by paragraph mode depending on how a translator use the unified interface in a desktop or mobile. If "History" section is translated in paragraph-paragraph mode in , we will be save cx-sections at a time. But if "History" section is translated in sentence by sentence mode, it is better to do the same there too. That is, whatever we are saving to cx_corpus table, irrespective of interface and translation mode, always save cx_sections. This allows seamless save and restore between mobile and desktop versions and translation modes.

In that sense, the section_id in cx_corpus table, as it is happening now is numbers like 18, 19, 20 etc..(Historically this used to be parsoid Ids, but in CX2 we use these numbers as parsoid Ids are very unreliable for section restore)

Section translation in sentence-by-sentence mode may get multiple MT engines because of the UX it has. In such cases the question of cx_origin value arises. In such cases a "multiple sources" value make sense to me as a simple and pragmatic solution. I don't think the usage of parallel corpus will be hindered by this choice. This can be considered even equivalent to "user" translation.

Pginer-WMF moved this task from Priority: SX Adoption to Quarter Backlog on the Language-Team (Language-2022-January-March) board.Jan 31 2022, 11:21 AM

ngkountas mentioned this in T287236: Encourage review of recently translated articles.Feb 1 2022, 1:28 AM

• SWakiyama mentioned this in T301573: Translation Parallel Text.Feb 11 2022, 4:59 PM

Pginer-WMF mentioned this in T302061: Encourage review of recently translated articles with Section Translation.Feb 18 2022, 10:40 AM

But for sentence-by-sentence model, when we save the section to cx_corpus table as part of publishing or in planned autosave, what should get saved at what granularity?

As per the discussions, this is our understanding:

cx_corpus table will get entries corresponding to cx-sections. Even the the current Section translation tool presents the mw-sections to users, internally we keep track of cx-sections in that mw-sections. In other words, If a History section of a wikipedia is translated, and it has 3 paragraphs, those 3 paragraphs will have 3 entries in cx_corpus table. The autosave mechanism of section translation tool need to prepare the content accordingly(for the save api)

In case of section translation(section-by section translation), the following table explains the meaning and updates happening to each field in the cx_translations table

Column name	Type	Description
translation_id	int autoincrement	Autogenerated primary key.
translation_source_title	varbinary(512) not null	Source article title
translation_target_title	varbinary(512) not null	Target aritcle title.
translation_source_language	varbinary(36) not null	Source language code
translation_target_language	varbinary(36) not null	Target language code
translation_source_revision_id	int unsigned	Revision id of source article. Used for restoring the translation against this revision of source article
translation_target_revision_id	int unsigned	Revision id of target article after publishing the section, or after publishing the target article. Not really used anywhere.
translation_source_url	text binary not null	Source article URL
translation_target_url	text binary default null	Target article URL. May not exist when doing lead section translation. But need to update whenever a publishing happens
translation_status	enum('draft', 'published', 'deleted') default null	For section translation, draft mean content got saved but not published yet. Published means published. If translator decides to delete, it only means the status change in the CX database. The unpublished work is deleted. Nothing happens for published section or article.
translation_start_timestamp	varchar(14) binary not null	Start date of this translation
translation_last_updated_timestamp	varchar(14) binary not null	Last updated date of this translation
translation_started_by	int	Who started translation. Allows anon translation since this already nullable
translation_last_updated_by	int	Same as translation_started_by for now
translation_progress	TINYBLOB not null	This is a json blob . It has this structure now: `{"any":0.010362694300518135,"mt":1,"human":0}` meaning 1% of article translated and 100% of that is MT. For SX, we can just leave it untouched if a value already exists. If not save the value as `{}`
translation_cx_version	tinyint unsigned default 1	Currently we insert 2 as value. Let us put 3 for SX

Change 764724 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/extensions/ContentTranslation@master] Add tables for section translation

https://gerrit.wikimedia.org/r/764724

gerritbot added a project: Patch-For-Review.Feb 22 2022, 9:23 AM

In T270499#7727244, @gerritbot wrote:

Change 764724 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/extensions/ContentTranslation@master] Add tables for section translation

https://gerrit.wikimedia.org/r/764724

New table creation task: https://phabricator.wikimedia.org/T302371 We can deploy this once patch is merged (Since it isn't schema change, we can do it ourselves, but still need to schedule).

Change 764724 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Add table for section translation

https://gerrit.wikimedia.org/r/764724

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.26; 2022-03-14).Mar 8 2022, 11:00 AM

Maintenance_bot removed a project: Patch-For-Review.Mar 8 2022, 11:10 AM

Pginer-WMF edited projects, added Language-Team (Language-2022-April-June); removed Language-Team (Language-2022-January-March).Mar 24 2022, 11:46 AM

Pginer-WMF moved this task from Quarter Backlog to Priority: SX Adoption on the Language-Team (Language-2022-April-June) board.

santhosh set the point value for this task to 8.Apr 5 2022, 5:57 AM

ngkountas claimed this task.Apr 13 2022, 4:18 PM

ngkountas moved this task from Priority: SX Adoption to In Progress on the Language-Team (Language-2022-April-June) board.

Change 793024 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX edit before publishing: Nest subsections into <section> elements

https://gerrit.wikimedia.org/r/793024

gerritbot added a project: Patch-For-Review.May 18 2022, 11:19 AM

Change 799289 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX: Store the MT provider used for the applied translation

https://gerrit.wikimedia.org/r/799289

Change 799291 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX subSection model: Add translationOrigin getter

https://gerrit.wikimedia.org/r/799291

Change 799292 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX: Add translationUnitPayload DTO and parallelCorporaUnits getter

https://gerrit.wikimedia.org/r/799292

Change 799293 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX: Add "sectionid" and "content" properties to the publish payload

https://gerrit.wikimedia.org/r/799293

Change 799295 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] CX: Add parallel corpora integration and SX storage in SXPublish API

https://gerrit.wikimedia.org/r/799295

ngkountas moved this task from In Progress to In Review on the Language-Team (Language-2022-April-June) board.May 26 2022, 8:12 AM

Change 799291 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX subSection model: Add translationOrigin getter

https://gerrit.wikimedia.org/r/799291

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.15; 2022-06-06).May 31 2022, 5:00 PM

Change 799289 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX: Store the MT provider used for the applied translation

https://gerrit.wikimedia.org/r/799289

Change 803933 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX: Add sxsave API action

https://gerrit.wikimedia.org/r/803933

Change 803934 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] Add saveTranslation method inside translatorAPI module

https://gerrit.wikimedia.org/r/803934

Change 793024 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX edit before publishing: Nest subsections into <section> elements

https://gerrit.wikimedia.org/r/793024

Change 799292 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX: Add translationUnitPayload DTO and parallelCorporaUnits getter

https://gerrit.wikimedia.org/r/799292

Change 803933 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX: Add sxsave API action

https://gerrit.wikimedia.org/r/803933

Change 803934 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Add saveTranslation method inside translatorAPI module

https://gerrit.wikimedia.org/r/803934

Change 799293 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX: Add request to "sxsave" api inside publishTranslation action

https://gerrit.wikimedia.org/r/799293

Change 809523 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20220629

https://gerrit.wikimedia.org/r/809523

Change 809523 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20220629

https://gerrit.wikimedia.org/r/809523

Pginer-WMF edited projects, added Language-Team (Language-2022-July-September); removed Language-Team (Language-2022-April-June).Jul 4 2022, 10:55 AM

Pginer-WMF moved this task from Quarter Backlog to In Review on the Language-Team (Language-2022-July-September) board.

santhosh moved this task from In Review to Check after deployment on the Language-Team (Language-2022-July-September) board.Jul 5 2022, 5:52 AM

Pginer-WMF updated the task description. (Show Details)Jul 5 2022, 10:56 AM

This is in production in test.wikipedia.org. Did a test translation there and published successfully. Queried the database. Records are there.

wikiadmin@10.64.48.58(testwiki)> select * from cx_section_translations; 
+---------+---------------------+-----------------+---------------------------+-----------------------------+ 
| cxsx_id | cxsx_translation_id | cxsx_section_id | cxsx_source_section_title | cxsx_target_section_title  | 
+---------+---------------------+-----------------+---------------------------+-----------------------------+ 
|      1 |                175 | 1089843849_3   | Effects of surface runoff | Effekte van oppervlakafloop | 
+---------+---------------------+-----------------+---------------------------+-----------------------------+ 
1 row in set (0.001 sec)

It appears in my published translations in CX dashboard. It is also counted as my translation stats for month. (test wiki has its own cx database, it is different from the shared db used for actual production wikis)

If I click on edit button for published translation, this is what I get. This is accurate section restore.

Parallel corpus captured from this translation - https://test.wikipedia.org/w/api.php?action=query&list=contenttranslationcorpora&translationid=175&striphtml=true

santhosh moved this task from Check after deployment to Needs QA on the Language-Team (Language-2022-July-September) board.Jul 6 2022, 4:38 AM

Change 811660 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX parallel corpora: Fix parallelCorporaMTContent calculation

https://gerrit.wikimedia.org/r/811660

Change 811660 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX parallel corpora: Fix parallelCorporaMTContent calculation

https://gerrit.wikimedia.org/r/811660

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.21; 2022-07-18); removed MW-1.39-notes (1.39.0-wmf.15; 2022-06-06).Jul 11 2022, 8:02 PM

Change 814545 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20220718

https://gerrit.wikimedia.org/r/814545

Change 814545 merged by jenkins-bot: