Page MenuHomePhabricator

Incorporate translated sections into the parallel corpora when published
Open, MediumPublic

Description

When an article is published with Content Translation, contents are also publicly exposed as part of the parallel corpora. In this way, anyone can use APIs or data dumps to get information about the translated paragraphs (original content, initial MT, user modifications, etc.)

We want sections published with SectionTranslation to also contribute to this useful data resource.

Metadata changes

As part of the work in this ticket we need to define any changes in the metadata to distinguish section translation from article translation, an how they impact external services using the data (e.g., Opus project). In particular, the current data schema assumes there will be IDs for the translation and the translator for each translation. This does not align with the current support for Section Translation where translations are not persisted yet and anonymous translation may be supported in the future.

Event Timeline

As we need to adjust the parallel corpora produced for this task we may want to consider also fixing T245607: CX Published parallel corpus is invalid json