Page MenuHomePhabricator

Remove very old translation drafts from CX database
Open, NormalPublic

Description

Translations in progress and not updated for a long time are present in CX database. Resuming these old draft translations are not successful compared to fresh drafts. As the source article change a lot, restoring old translations again it is difficult and sometimes end up as using old source revision as source article. These are suboptimal. There is a chance that the translator may not come back to continue the translation. In that case, the draft translation is preventing from another translator to do the translation of same article (since we don't support collaborative translation).

Also as noted in T183485: Please consider purging/moving the cx_corpora table at x1 , cx_corpora table is growing as more translations happen and its size is becoming a concern.

To address this issue, one suggestion is to remove the entry from cx_translations table and the data from cx_corpora table.

Proposed approach

We want to use notifications at different points to let users know what is going on and encourage them to complete their translations. These are the notifications to support:

The diagram below illustrates how these notifications work together:

Backend tools

Some other scripts and tools are needed to properly support this process:

Related Objects

Event Timeline

santhosh created this task.Jan 2 2018, 8:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 2 2018, 8:45 AM
santhosh triaged this task as Normal priority.

Data about current ages of drafts would be nice would be nice to have more insight into the numbers. Without numbers I would go for notification after age > 60 days and purge after age > 180 days.
I should set up a maintenance script to run weekly that purges the old articles.
Notifications can be done with Echo.
We can amend cx_translations to have new state purged and update the UI accordingly.

We should not delete translations that are published. Hence I don't expect that this will solve the parent task as the cx_corpora table will keep growing. For that we need a long term solution that is different from purging old drafts. We should still proceed with purging old drafts because of the issue about resuming them and preventing other users translating them.

When we start we should also consider doing a general notification to all CX users (tech news or other medium) that we are going to start purging old drafts. We can start from purging only the very oldest ones and gradually decrease the maximum age until we reach the target value.

santhosh added a comment.EditedJan 2 2018, 11:00 AM

I collected some numbers about the drafts

Draft translations that are more than 6 month old23797
Draft translations that are more than a year old13927
Translations never published so far73628
Translations once published but now in draft status30002
Total number of records in cx_translations so far378635
Total published translations239810
Total deleted translations35198

We should not delete translations that are published.

@Nikerabbit, What problems you see when we remove translations that are published? For parallel corpora, we keep them in dumps.

Dumps are re-created from scratch every week. Even if we assume old dumps are never deleted (which I believe is not true), people downloading the latest dumps would not have all the content that is available.

Arrbee added a comment.Jan 2 2018, 4:35 PM

Should we send notifications to translators? If so
What should be the notification sending mechanism? Echo?

See also: T89707 and T106693

Pginer-WMF added a comment.EditedJan 3 2018, 12:40 PM

I think notifications can be useful in this context. We can use them to inform as well as to encourage users to work on the topics they were once interested in, but forgot to follow though. I'd propose the following approach:

  • Immediate notification (T106693). Only for the first translation, when the user leaves the tool. We tell the user that the translation was saved and how to continue working on it. This can reduce the unpublished translators due to initial disorientation.
  • Short term notification (T89707). For translations in draft status for more than 3 months. We tell the user that they have not worked on the draft translation for awhile, and that these will get automatically discarded after a year to avoid editing conflicts.
  • Long term notification and deletion. For translations in draft mode for more than a year, not being published. We tell the user that the draft was deleted since it was based on content that is more than one year old.
    • For translations older than a year that have been already published, those can be deleted without further notification. The user was warned after three months, and it makes not much sense to encourage to translate the content now.

Further considerations:

  • I proposed periods of time that are easy to communicate (e.g., one year), but we can use different ones if there are some technical considerations.
  • I assumed that the "time since the user worked in the translation" is hard to obtain, and "time since the translation was started" is a good approximation, but let me know if you think these assumptions don't seem right. For example, if it is common to find a user slowly working on a translation during the course of more than a month, we may want to adjust the thresholds.
  • For these notifications we want to use the bundling capabilities that Echo provides. That is, a user that started ten translations and forgot about them should receive a single expandable notification, instead of ten different ones.
  • The numbers provided by @santhosh in T183890#3866697 are very useful. It would be good to keep those queries at hand to check how much those numbers increase in a month, and how they change after an intervention like the one proposed.
  • When we talk about automatically deleting/purging translation drafts, if those are published I'd expect the published article to still be shown at the "published" list.

If the approach makes sense, I can work on the details of the specific notifications (language, icons, and actions to use).

Based on today's discussions there seems to be agreement to start with published translations because of a smaller user impact and bigger reduction of size.

Nikerabbit renamed this task from Remove very old unpublished translation drafts from CX database to Remove very old translation drafts from CX database.Jan 4 2018, 11:01 AM

I illustrated how the different notifications play together below:

I'll update the description and sub-tickets

Pginer-WMF updated the task description. (Show Details)Jan 5 2018, 12:02 PM
Pginer-WMF updated the task description. (Show Details)Mar 7 2018, 8:55 AM
Pginer-WMF updated the task description. (Show Details)Feb 5 2019, 8:33 AM
Pginer-WMF updated the task description. (Show Details)Mar 11 2019, 1:27 PM