Page MenuHomePhabricator

Compress cx_corpora.cxc_content
Closed, ResolvedPublic2 Estimated Story Points

Description

This field is HTML and while we were investigating something else (T351871), this table showed up as the biggest table of x1 (and by far, more than the next top ten tables combined).

grafik.png (2×4 px, 346 KB)

Compressing the HTML is an extremely low hanging fruit improving not just storage but also network and performance since the compression ratio of HTML is pretty good and we already do that with ParserCache and some other raw blobs. You can simply call gzdeflate in store and gzinflate when retrieval. Keeping b/c for the existing entries might be a bit tricky though.

Event Timeline

Change #1153569 had a related patch set uploaded (by Huei Tan; author: Huei Tan):

[mediawiki/extensions/ContentTranslation@master] CX: Compress cx_corpora.cxc_content

https://gerrit.wikimedia.org/r/1153569

Change #1153569 abandoned by Huei Tan:

[mediawiki/extensions/ContentTranslation@master] CX: Compress cx_corpora.cxc_content

Reason:

deflate/inflate happens on the translation controller

https://gerrit.wikimedia.org/r/1153569

PWaigi-WMF changed the task status from Open to In Progress.Jun 5 2025, 1:14 PM

Change #1154263 had a related patch set uploaded (by Huei Tan; author: Huei Tan):

[mediawiki/extensions/ContentTranslation@master] CX: Do not inflate the html sent to database during saving/publish

https://gerrit.wikimedia.org/r/1154263

Change #1154263 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] TranslationCorporaStore: Compress the html sent to database during draft saving

https://gerrit.wikimedia.org/r/1154263

This is great to see happening! Thank you!

abi_ moved this task from Need QA to Done on the LPL Essential (LPL Essential 2025 Apr-Jun: CX) board.
abi_ subscribed.

I'm able to see the compressed data in cx_corpora table. I was able to save drafts, and publish translations.

Thanks. Stupid question. When does most of the rows will become compressed? The reason I'm asking is that mariadb doesn't shrink the tables automatically and I need to run an optimize table on them but it'd be better to wait for the rows to switch first. Even doing it after 70% or 80% done would be good enough. This table is quite large right now:

root@db2186:/srv/sqldata# ls -Ssh */cx_corpo*
179G wikishared/cx_corpora.ibd	4.0K wikishared/cx_corpora.frm

The most reliable way to compress the old content would be to go through the old records and compress them. We'll need to write a script to do that.

I mean old translations get deleted after X (one year?) days so there should be a distribution of them. If it'll take a long time, then yeah we should do a script

Inactive unpublished drafts are purged approximately after 455 days. Published translations are not purged. We probably need to revisit T183485: Please consider purging/moving the cx_corpora table at x1.

The table is now around half of all x1. That in itself wouldn't been issue but from what you're saying it means this table is growing quite fast without bound and that is worrying. If the published translation is immutable, it should move to ES which is designed to wlhold large immutable blobs forever. If you need to change that from time to time, then it can't got to ES but needs some other solution. For example, you can delete anything that is in a dump and basically provide incremental dumps instead. Or the simplest would be to drop published ones after some longer times too. E.g. after five years. I fail to see much use in really old published translations, the models change, ...