We have (have had for some time) abnormally wide partitions in Cassandra. These are the source of a number of problems, not least of which are fatally large heap allocations that result in [[ https://phabricator.wikimedia.org/T140946 | OOMs ]] when read.
We should a) find those that currently exist and clean them up, and b) put in place the means to proactively identify them moving forward.
The lowest hanging of fruit can be found by grepping the logs created by compaction. Here are the top 48 according to logs I collected today:
{P3843}
{P3844}
----
== First pass ==
| | Count | Description |
|----------------|---------|-----------------------------------------|
| {F4385579} | 18 | Partitions larger than 10G in size |
| {F4385581} | 30 | > 5G and <= 10G in size |
| {F4385582} | 653 | > 1G and <= 5G in size |
Working files
{F4385595}
{F4385597}
{F4385598} (raw log entries)
{F4385667}
{F4385666}
{F4385668}