Page MenuHomePhabricator

Enable transparent compression of Airflow task logs in S3
Open, MediumPublic

Description

The dumps task logs are highly compressible (example). We could probably save a lot of space (traded for a bit of CPU) by enabling on-the-fly compression in Ceph.

NOTE: This would benefit all our S3 users, as this configuration is set at the pool level, and not at the bucket level.

Example:

$ radosgw-admin zone placement modify \
      --rgw-zone default \
      --placement-id default-placement \
      --storage-class STANDARD \
      --compression zlib

See https://docs.ceph.com/en/latest/radosgw/compression/

Event Timeline

brouberol triaged this task as Medium priority.

With regard to your point about:

This would benefit all our S3 users, as this configuration is set at the pool level, and not at the bucket level.

The only issue here is that there are some potential use cases that deal with pre-compressed files.

For example, in T381416: Do performance testing of a big Hadoop Table hosted by Ceph we are discussing the use of storing parquet files, as used by the hive metastore.
Unless we override this, then we will be using snappy compression by default, as well as whatever we configure for the Ceph pool.

So I'm not sure that we want to enable compression globally for S3 at the moment.
But maybe we should look into a whether we have different placement rules available, with both compressed and uncompressed pools being available.

Let's take this out of the current milestone, as I am not sure that we need to do it yet.

Removing the parent task, as I don't think that we need this to be in the critical path for getting dumps migrated.