Prior to the outage on 2025-03-31 we were under the impression that sessionstore was wildly over-provisioned. What that incident demonstrated though, was that an aberrant workload has the potential to create rapid, unsustainable growth. Worse, every indication is that the workload in question was accidental / unintentional, a bad actor with an understanding of the circumstances could probably do much worse. We should increase storage capacity to increase runway, and buy us more time in such situations.
The current disk configuration uses two 480GB SSDs in a software RAID1. LVM is used to create volumes for swap, /, and /srv. The latter is used exclusively by Cassandra, and is ~370GB in size. What I propose is to use a smaller RAID1 for swap and /, and leave the remaining space on each drive for a JBOD configuration in Cassandra (w/ Cassandra system tables stored on the RAID). This would double the space available to Cassandra. Obviously this will require a reimage of each host.
With this configuration in place, we can later add SSDs to the JBOD if we determine more space is needed.
Finally, there is some impetus to increase storage density in all our clusters (where possible), and the use of JBOD configurations in Cassandra is something being considered more widely (see: T380416: modernize cassandra deployments). The type of configuration discussed here seems as though it could be standardized and utilized for other clusters as well (read: we should keep that in mind).
Edit (2025-05-12):
The updated proposal looks something like this (see also: r1142635):
This sets aside ~60G for swap, /, and /srv/cassandra/instance-data from every drive, (critically, for those not needed for the RAID1). That's about 13% against the 480G SSDs used here (which seems like quite a lot), or ~3% for the 1.9T SSDs (which we probably ought to standardize on).
Maybe worth noting, the RAID1 couldl be extended over more than 2 drives, which could provide a bit more redundancy and read throughput (even if neither is really needed).
See also: T390630: Alert when disk space utilization on sessionstore nodes is trending high

