Page MenuHomePhabricator

Please consider purging/moving the cx_corpora table at x1
Open, HighPublic

Description

The cx_corpora table on wikishared has grown to 38GB in size (that is around 1 third of x1):

-rw-rw----   1 mysql mysql  38G Dec 21 18:01 cx_corpora.ibd

My guess is that it stores in progress translation from contributors, but it doesn't seem to be purged regularly (first edit seems to be from January 2016). Maybe a policy should be set of how long those are stored, as otherwise we would be eventually storing all content on that table. If that is not ok, maybe ExternalStore should be used to store chunks of content (ES is optimized to have large objects, and we normally try to avoid mixing actual data and metadata).

This is a request for consideration, ping me back if you want ideas on how to purge them safely (purging lots of data at the same time can be dangerous) if you go on that route.

Related Objects

Event Timeline

jcrespo renamed this task from Please consider purging the cx_translation table to Please consider purging the cx_corpora table.Dec 21 2017, 6:09 PM

I have 2 suggestions for candiates to delete from the table:

  1. Translations that are 6 months old AND in Draft state - Remove the entry from cx_translations table and the data from cx_corpora table. We can consider some notifications to translators - I think we(CX team) has considered this and even created a ticket. I can't locate it immediately.
  2. Translations that are 6 months old AND in Published state - We allow editing published translations, but I don't think it make any sense to edit it after 6 months since the original source article and published article would have changed a lot. 6 Months here is an example, it may be shorter.

We don't want to deleted published articles as we require those for the corpora dumps.

@jcrespo, We are planning some development effort to remove the data from this table, but mostly it require notification to users and some wait period. We may have some opportunity to delete without notification and wait time for very old data. We are discussing that too.

Just to help our prioritization, What is the urgency of this issue? This 38GB was accumulated over a period of ~2+ years. So if we have some 3 months delay in not reducing the size, do you see anything serious from database performance or storage perspective?

This is not urgent, that is why I filed as "Please consider purging". The issue is for long term- I do not think it is a good idea to keep accumulating content on a metadata content table- but it is not an unbreak now. Note that I also mention that if purging is not desirable, we can store content on External Stores, just not on metadata hosts (s[1-8], x1).

I want to stress that I do not really need purging if that is not desirable, as long as chunks of content are not on s*/x* hosts and are moved to es* (content hosts).

jcrespo renamed this task from Please consider purging the cx_corpora table to Please consider purging/moving the cx_corpora table at x1.Feb 13 2018, 8:57 AM

We have created tickets to capture the short term (T183890) and long term (T189093) solutions in more detail.

@jcrespo We have set up a regular purge of unused content, but most of what remains cannot be purged because it is primary content. How is the table size now? You said this task is not urgent, but I understand that sooner or later this issue needs to be dealt with. Do you have any estimate at which point the table may start causing problems, or other comments that would help to put a priority and/or deadline on this task?

FWD: @Marostegui You may want to defragment the named table before answering the question.

FWD: @Marostegui You may want to defragment the named table before answering the question.

Definitely, thanks for making me aware of this conversation.
I will defragment the table on one host and come back with some numbers.
As of now on a COMPRESSED host:

root@db1127:/srv/sqldata# find . -name cx_corpora.ibd | xargs ls -lh
-rw-r----- 1 mysql mysql 32G Dec 10 09:35 ./wikishared/cx_corpora.ibd

And on a non compressed host (the master):

root@db1120:/srv# find . -name cx_corpora.ibd | xargs ls -lh
-rw-rw---- 1 mysql mysql 91G Dec 10 09:34 ./sqldata/wikishared/cx_corpora.ibd

We should compress it everywhere anyways, just a few hosts pending getting it compressed.

Change 556143 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1127

https://gerrit.wikimedia.org/r/556143

Change 556143 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1127

https://gerrit.wikimedia.org/r/556143

Mentioned in SAL (#wikimedia-operations) [2019-12-10T09:51:28Z] <marostegui> Optimize wikishared. cx_corpora on db1127 - T183485

There is not much gain on a compressed host, just 1GB

root@db1127:/srv/sqldata/wikishared# ls -lh cx_corpora.ibd
-rw-rw---- 1 mysql mysql 31G Dec 10 10:31 cx_corpora.ibd

Mentioned in SAL (#wikimedia-operations) [2019-12-10T10:36:46Z] <marostegui> Optimize wikishared.cx_corpora on db2115 (non compressed table) - T183485

Testing on a non compressed host before compressing, to see how it goes:
Current status:

root@db2115:/srv/sqldata/wikishared# ls -lh cx_corpora.ibd
-rw-rw---- 1 mysql mysql 92G Dec 10 10:36 cx_corpora.ibd
root@db2115:/srv/sqldata/wikishared#

A non compressed host also doesn't make much difference after defragmenting:

root@db2115:/srv/sqldata/wikishared# ls -lh cx_corpora.ibd
-rw-rw---- 1 mysql mysql 89G Dec 10 13:20 cx_corpora.ibd

And compressing the table:

root@db2115:/srv/sqldata/wikishared# ls -lh cx_corpora.ibd
-rw-rw---- 1 mysql mysql 31G Dec 10 14:53 cx_corpora.ibd

We definitely need to compress the table T240325: Compress wikisahred.cx_corpora on x1 hosts

@Nikerabbit the purge has run and purged as much as possible or is still running? Will it run a regular basis?

@Nikerabbit the purge has run and purged as much as possible or is still running? Will it run a regular basis?

Yes, the "purge backlog" is empty and we regularly purge new purgeable content that is older than our threshold. The script runs twice a month by cron: https://gerrit.wikimedia.org/g/operations/puppet/+/7ca6231c1d50e7fdb701ec27e01e1e7933b99689/modules/mediawiki/manifests/maintenance/purge_old_cx_drafts.pp

ok, so after compression and defragmentation the table sizes are now 31GB - let see how it grows after it gets fragmented again, so far it has remained kinda the same size (38GB when this task was created 2 years ago), let's see in 1-2 months.