Currently, the only visibility into partition wide comes via a graph in the summary dashboard, which is keyspace-specific. This means, for someone to become aware of an issue (T187255: Investigate abnormally wide partitions for example), they must expand this row in the dashboard, and cycle through the various keyspace templates, looking at each in turn. This is not realistic.
Some ideas:
- Create a dashboard (a compact, single stat panel, ideally) of topk max partition sizes for all keyspaces and all nodes
- Create a check_prometheus-based alert for Icinga for the highest-max for all keyspaces and all nodes
- Resurrect the [[ https://github.com/wikimedia/services-adhoc-reports/blob/master/report-topk-partion-size | report-topk-partion-size ]] script
- All of the above