When issues arise with a Cassandra node, it is often most expedient to simply restart it and restore normal operation. However, doing so could destroy valuable information needed to track down the root cause. Since it is not realistic to assume that everyone responding to a alert will know what to look for, we should create a script to automate collecting and archiving relevant data for later examination.
Some ideas:
- Heap dumps
- Stack dump (or capture)
- Logs (debug)
- nodetool (if possible)
- status
- gcstats
- compationthroughput
- streamthroughput
- gossipinfo
- proxyhistograms
- toppartitions
- tpstats