Page MenuHomePhabricator

RESTBase Cassandra high utilization alarms (instance-data)
Closed, ResolvedPublic

Description

During planned PDU maintenance in codfw, Cassandra hosts in the eqiad datacenter experienced abnormally high storage utilization of the /srv/cassandra/instance-data volumes.

image.png (790×1 px, 296 KB)


See also: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-10_cassandra_disk_space

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-08-10T14:38:23Z] <urandom> disabling reserved space on eqiad nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- T314941

Eevans updated the task description. (Show Details)
Eevans triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-operations) [2022-08-10T15:37:30Z] <urandom> (ephemerally) increasing hinted hand-off delivery rate limit to 16KB, RESTBase eqiad nodes -- T314941

Mentioned in SAL (#wikimedia-operations) [2022-08-10T15:46:35Z] <urandom> flushing tables in row A (RESTBase Cassandra cluster) -- T314941

Change 822110 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassanrdra: Increase hint delivery throughput

https://gerrit.wikimedia.org/r/822110

Mentioned in SAL (#wikimedia-operations) [2022-08-10T15:51:56Z] <urandom> flushing tables in row B (RESTBase Cassandra cluster) -- T314941

Mentioned in SAL (#wikimedia-operations) [2022-08-10T16:09:11Z] <urandom> flushing tables in row D (RESTBase Cassandra cluster) -- T314941

Change 822110 merged by Hnowlan:

[operations/puppet@production] cassanrdra: Increase hint delivery throughput

https://gerrit.wikimedia.org/r/822110

Mentioned in SAL (#wikimedia-operations) [2022-08-10T16:29:28Z] <urandom> restarting Cassandra (RESTBase) -row A- to apply r822110 -- T314941

Mentioned in SAL (#wikimedia-operations) [2022-08-10T17:06:49Z] <urandom> flushing RESTBase Cassandra tables -row B- to (temporarily) free instance-data space -- T314941

Mentioned in SAL (#wikimedia-operations) [2022-08-10T18:13:27Z] <urandom> truncating codfw Cassandra hints (eqiad datacenter) -- T314941

Mentioned in SAL (#wikimedia-operations) [2022-08-10T18:22:25Z] <urandom> truncating Cassandra hints (eqiad datacenter) -- T314941

Mentioned in SAL (#wikimedia-operations) [2022-08-17T18:40:57Z] <urandom> disabling reserved space on codfw nodes (RESTBase), /dev/md2 (aka /srv/cassandra/instance-data) -- T314941