Reporting of wide Cassandra partitions
Open, LowPublic
Actions

Assigned To

None

Authored By

	Eevans
	Feb 13 2018, 9:46 PM

Description

Currently, the only visibility into partition wide comes via a graph in the summary dashboard, which is keyspace-specific. This means, for someone to become aware of an issue (T187255: Investigate abnormally wide partitions for example), they must expand this row in the dashboard, and cycle through the various keyspace templates, looking at each in turn. This is not realistic.

Some ideas:

Create a dashboard (a compact, single stat panel, ideally) of topk max partition sizes for all keyspaces and all nodes
Create a check_prometheus-based alert for Icinga for the highest-max for all keyspaces and all nodes
Resurrect the [[ https://github.com/wikimedia/services-adhoc-reports/blob/master/report-topk-partion-size | report-topk-partion-size ]] script
All of the above

NOTE: #1 and 2 above will probably require the use of a service-side Prometheus aggregation (aka rule)

Related Objects

Mentioned Here: T187255: Investigate abnormally wide partitions

Event Timeline

Eevans created this task.Feb 13 2018, 9:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2018, 9:46 PM

• mobrovac edited projects, added Services (next); removed Services.Mar 1 2018, 8:39 PM

• mobrovac added a project: Platform Team Legacy (Later).Dec 20 2018, 12:16 PM

Eevans triaged this task as Low priority.Jun 7 2021, 8:11 PM

Eevans removed a project: User-Eevans.Jun 9 2021, 4:44 PM

Is this work still relevant? https://phabricator.wikimedia.org/T187255#5066229 implies not but I'm not certain. Are wide partitions not a greatly reduced concern under 3.11?
The relevant panel is now missing from the dashboard and may need updating/recreating.

In T187260#7358761, @hnowlan wrote:

Is this work still relevant? https://phabricator.wikimedia.org/T187255#5066229 implies not but I'm not certain. Are wide partitions not a greatly reduced concern under 3.11?
The relevant panel is now missing from the dashboard and may need updating/recreating.

That comment eludes to past practices that certainly made partition width contentious. We're not committing the sin of those particular anti-patterns anymore, but it is still possible for partitions to be too wide. Given our direction toward platforms (and multi-tenancy generally), I think the possibility of aberrant workloads will only increase, as will the potential for collateral damage.

TL;DR yes, this is still relevant.

Reporting of wide Cassandra partitionsOpen, LowPublicActions

Description

Related Objects

Event Timeline

Reporting of wide Cassandra partitions
Open, LowPublic
Actions