Page MenuHomePhabricator

Reporting of wide Cassandra partitions
Open, LowPublic

Description

Currently, the only visibility into partition wide comes via a graph in the summary dashboard, which is keyspace-specific. This means, for someone to become aware of an issue (T187255: Investigate abnormally wide partitions for example), they must expand this row in the dashboard, and cycle through the various keyspace templates, looking at each in turn. This is not realistic.

Some ideas:

  1. Create a dashboard (a compact, single stat panel, ideally) of topk max partition sizes for all keyspaces and all nodes
  2. Create a check_prometheus-based alert for Icinga for the highest-max for all keyspaces and all nodes
  3. Resurrect the [[ https://github.com/wikimedia/services-adhoc-reports/blob/master/report-topk-partion-size | report-topk-partion-size ]] script
  4. All of the above
NOTE: #1 and 2 above will probably require the use of a service-side Prometheus aggregation (aka rule)

Event Timeline

Eevans triaged this task as Low priority.Jun 7 2021, 8:11 PM

Is this work still relevant? https://phabricator.wikimedia.org/T187255#5066229 implies not but I'm not certain. Are wide partitions not a greatly reduced concern under 3.11?
The relevant panel is now missing from the dashboard and may need updating/recreating.

Is this work still relevant? https://phabricator.wikimedia.org/T187255#5066229 implies not but I'm not certain. Are wide partitions not a greatly reduced concern under 3.11?
The relevant panel is now missing from the dashboard and may need updating/recreating.

That comment eludes to past practices that certainly made partition width contentious. We're not committing the sin of those particular anti-patterns anymore, but it is still possible for partitions to be too wide. Given our direction toward platforms (and multi-tenancy generally), I think the possibility of aberrant workloads will only increase, as will the potential for collateral damage.

TL;DR yes, this is still relevant.