Couple of notes:
- We'd need to write a meaningful runbook to instruct people what metrics to check (mcrouter, redis, etc..)
- Refactor https://grafana.wikimedia.org/dashboard/db/redis to show per host usage metrics (rather than only aggregated results). In https://phabricator.wikimedia.org/T223310 it was clear that the aggregated metrics for Redis ops usage are too coarse grained to show any outlier.
@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?
I'm not very familiar with the problem, but my suggestion would be to alert on either on symptoms (ideally as experienced by users) or as high level as reasonably possible. In this case redis was involved, thus alarming on at least redis metrics makes more sense to me, or maybe even higher level like mediawiki? My two cents though, it is possible these options have been explored and discarded already! re: swift bandwidth metrics, those are referred to in the grafana dashboard not in an alert, I'm not sure I understand
re: bandwidth itself, I believe we do have port utilization alerts based on librenms (cc @ayounsi) though e.g. I don't know at what threshold etc.
The difficult bit is that we don't have good visibility about how "expensive" in term of tx bandwidth commands to redis/memcached are. For example, in this case there was a huge increase in requests to Redis, but it might also happen that a particular low-rate GET triggers a huge response that fills the tx bandwidth. What I'd like to have is a generic alarm for bandwidth usage, very coarse grain but effective to say spot a regression after a mediawiki deployment or similar. Didn't think about librenms, could be something to investigate!
@ayounsi thoughts? :)
I have something in LibreNMS: "Access port utilization over 80% for 1h". But not set to alert, it's mostly used as a FYI, so I have visibility on hosts that can be problematic in the future.
I don't think LibreNMS is the proper tool for that specific mc* alerts:
- It only have a 5min granularity
- Doesn't integrates with Icinga
- Only match servers using the switch port description
- Can't easily display all the target server's bandwidth (eg. aggregate view)
Services behave differently when there is congestion. I think they all should alert, but with different time windows.
For example one service might need an emergency response after 30min saturating its uplink, and some a notification after a few hours.