Redis monitoring and alarming could be better:
- We collect data on diamond but besides an occasional grafana dashboard, we do nothing with those. We should look at them for trends, hotspots and start alarming on those.
- Our current replication monitoring is pretty lame as it falls victim of the Great Puppet Monitoring Race Condition: For redises in a multi-dc setup, when we switch the replication flow from one site to the other, the replication flow gets inverted by the puppet run on the hosts; until they've completed running AND puppet has completed running on the monitoring host we have a discrepancy between what we're testing for and what we're actually configuring. This results in a ton of false positives that we want to avoid.
- We probably want to alarm not just on trends recorded to grafana, but on other facts like io starvation, swarms of connections, etc
See also: