Page MenuHomePhabricator

Redis monitoring needs to be improved
Open, MediumPublic

Description

Redis monitoring and alarming could be better:

  • We collect data on diamond but besides an occasional grafana dashboard, we do nothing with those. We should look at them for trends, hotspots and start alarming on those.
  • Our current replication monitoring is pretty lame as it falls victim of the Great Puppet Monitoring Race Condition: For redises in a multi-dc setup, when we switch the replication flow from one site to the other, the replication flow gets inverted by the puppet run on the hosts; until they've completed running AND puppet has completed running on the monitoring host we have a discrepancy between what we're testing for and what we're actually configuring. This results in a ton of false positives that we want to avoid.
  • We probably want to alarm not just on trends recorded to grafana, but on other facts like io starvation, swarms of connections, etc

See also:

Event Timeline

Joe triaged this task as Medium priority.

Change 284489 had a related patch set uploaded (by Giuseppe Lavagetto):
redis::monitoring::instance: partially disable replication checks

https://gerrit.wikimedia.org/r/284489

Change 284489 merged by Giuseppe Lavagetto:
redis::monitoring::instance: partially disable replication checks

https://gerrit.wikimedia.org/r/284489

Joe removed Joe as the assignee of this task.Apr 27 2016, 10:58 AM