Page MenuHomePhabricator

Create an alert for high memcached bw usage
Open, NormalPublic

Description

After fixing T223310, @elukey proposed we create a bandwidth alert for all mc* hosts. This will help us identify changes early enough about keys that could possibly exhaust our host's bw

Event Timeline

jijiki created this task.May 28 2019, 8:44 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 28 2019, 8:44 AM
jijiki triaged this task as Normal priority.May 28 2019, 9:11 AM
jijiki updated the task description. (Show Details)

Couple of notes:

kchapman moved this task from Inbox to Radar on the Performance-Team board.May 28 2019, 7:54 PM
kchapman edited projects, added Performance-Team (Radar); removed Performance-Team.
elukey moved this task from Backlog to Mcrouter/Memcached on the User-Elukey board.Jul 5 2019, 6:53 AM

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

I'm not very familiar with the problem, but my suggestion would be to alert on either on symptoms (ideally as experienced by users) or as high level as reasonably possible. In this case redis was involved, thus alarming on at least redis metrics makes more sense to me, or maybe even higher level like mediawiki? My two cents though, it is possible these options have been explored and discarded already! re: swift bandwidth metrics, those are referred to in the grafana dashboard not in an alert, I'm not sure I understand

re: bandwidth itself, I believe we do have port utilization alerts based on librenms (cc @ayounsi) though e.g. I don't know at what threshold etc.

elukey added a comment.Jul 5 2019, 1:03 PM

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

I'm not very familiar with the problem, but my suggestion would be to alert on either on symptoms (ideally as experienced by users) or as high level as reasonably possible. In this case redis was involved, thus alarming on at least redis metrics makes more sense to me, or maybe even higher level like mediawiki? My two cents though, it is possible these options have been explored and discarded already! re: swift bandwidth metrics, those are referred to in the grafana dashboard not in an alert, I'm not sure I understand

The difficult bit is that we don't have good visibility about how "expensive" in term of tx bandwidth commands to redis/memcached are. For example, in this case there was a huge increase in requests to Redis, but it might also happen that a particular low-rate GET triggers a huge response that fills the tx bandwidth. What I'd like to have is a generic alarm for bandwidth usage, very coarse grain but effective to say spot a regression after a mediawiki deployment or similar. Didn't think about librenms, could be something to investigate!

@ayounsi thoughts? :)

fgiunchedi moved this task from Backlog to Radar on the observability board.Jul 8 2019, 1:08 PM
ayounsi added a comment.EditedJul 8 2019, 9:59 PM

re: bandwidth itself, I believe we do have port utilization alerts based on librenms (cc @ayounsi) though e.g. I don't know at what threshold etc.

I have something in LibreNMS: "Access port utilization over 80% for 1h". But not set to alert, it's mostly used as a FYI, so I have visibility on hosts that can be problematic in the future.

I don't think LibreNMS is the proper tool for that specific mc* alerts:

  • It only have a 5min granularity
  • Doesn't integrates with Icinga
  • Only match servers using the switch port description
  • Can't easily display all the target server's bandwidth (eg. aggregate view)

Services behave differently when there is congestion. I think they all should alert, but with different time windows.
For example one service might need an emergency response after 30min saturating its uplink, and some a notification after a few hours.

elukey added a comment.Jul 9 2019, 2:47 PM

Makes sense, I am now wondering if we should create a generic and configurable alarm or not :)