Page MenuHomePhabricator

Please consider purging/moving the cx_corpora table at x1
Open, HighPublic


The cx_corpora table on wikishared has grown to 38GB in size (that is around 1 third of x1):

-rw-rw----   1 mysql mysql  38G Dec 21 18:01 cx_corpora.ibd

My guess is that it stores in progress translation from contributors, but it doesn't seem to be purged regularly (first edit seems to be from January 2016). Maybe a policy should be set of how long those are stored, as otherwise we would be eventually storing all content on that table. If that is not ok, maybe ExternalStore should be used to store chunks of content (ES is optimized to have large objects, and we normally try to avoid mixing actual data and metadata).

This is a request for consideration, ping me back if you want ideas on how to purge them safely (purging lots of data at the same time can be dangerous) if you go on that route.

Related Objects

Event Timeline

jcrespo created this task.Dec 21 2017, 6:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2017, 6:09 PM
jcrespo renamed this task from Please consider purging the cx_translation table to Please consider purging the cx_corpora table.Dec 21 2017, 6:09 PM
Nikerabbit triaged this task as High priority.Dec 21 2017, 7:13 PM

I have 2 suggestions for candiates to delete from the table:

  1. Translations that are 6 months old AND in Draft state - Remove the entry from cx_translations table and the data from cx_corpora table. We can consider some notifications to translators - I think we(CX team) has considered this and even created a ticket. I can't locate it immediately.
  2. Translations that are 6 months old AND in Published state - We allow editing published translations, but I don't think it make any sense to edit it after 6 months since the original source article and published article would have changed a lot. 6 Months here is an example, it may be shorter.

We don't want to deleted published articles as we require those for the corpora dumps.

@jcrespo, We are planning some development effort to remove the data from this table, but mostly it require notification to users and some wait period. We may have some opportunity to delete without notification and wait time for very old data. We are discussing that too.

Just to help our prioritization, What is the urgency of this issue? This 38GB was accumulated over a period of ~2+ years. So if we have some 3 months delay in not reducing the size, do you see anything serious from database performance or storage perspective?

This is not urgent, that is why I filed as "Please consider purging". The issue is for long term- I do not think it is a good idea to keep accumulating content on a metadata content table- but it is not an unbreak now. Note that I also mention that if purging is not desirable, we can store content on External Stores, just not on metadata hosts (s[1-8], x1).

I want to stress that I do not really need purging if that is not desirable, as long as chunks of content are not on s*/x* hosts and are moved to es* (content hosts).

jcrespo renamed this task from Please consider purging the cx_corpora table to Please consider purging/moving the cx_corpora table at x1.Feb 13 2018, 8:57 AM

We have created tickets to capture the short term (T183890) and long term (T189093) solutions in more detail.