Page MenuHomePhabricator

Create an alert for high memcached bw usage
Closed, ResolvedPublic

Description

After fixing T223310, @elukey proposed we create a bandwidth alert for all mc* hosts. This will help us identify changes early enough about keys that could possibly exhaust our host's bw

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMay 28 2019, 8:44 AM
jijiki triaged this task as Medium priority.May 28 2019, 9:11 AM
jijiki updated the task description. (Show Details)

Couple of notes:

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

I'm not very familiar with the problem, but my suggestion would be to alert on either on symptoms (ideally as experienced by users) or as high level as reasonably possible. In this case redis was involved, thus alarming on at least redis metrics makes more sense to me, or maybe even higher level like mediawiki? My two cents though, it is possible these options have been explored and discarded already! re: swift bandwidth metrics, those are referred to in the grafana dashboard not in an alert, I'm not sure I understand

re: bandwidth itself, I believe we do have port utilization alerts based on librenms (cc @ayounsi) though e.g. I don't know at what threshold etc.

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

I'm not very familiar with the problem, but my suggestion would be to alert on either on symptoms (ideally as experienced by users) or as high level as reasonably possible. In this case redis was involved, thus alarming on at least redis metrics makes more sense to me, or maybe even higher level like mediawiki? My two cents though, it is possible these options have been explored and discarded already! re: swift bandwidth metrics, those are referred to in the grafana dashboard not in an alert, I'm not sure I understand

The difficult bit is that we don't have good visibility about how "expensive" in term of tx bandwidth commands to redis/memcached are. For example, in this case there was a huge increase in requests to Redis, but it might also happen that a particular low-rate GET triggers a huge response that fills the tx bandwidth. What I'd like to have is a generic alarm for bandwidth usage, very coarse grain but effective to say spot a regression after a mediawiki deployment or similar. Didn't think about librenms, could be something to investigate!

@ayounsi thoughts? :)

re: bandwidth itself, I believe we do have port utilization alerts based on librenms (cc @ayounsi) though e.g. I don't know at what threshold etc.

I have something in LibreNMS: "Access port utilization over 80% for 1h". But not set to alert, it's mostly used as a FYI, so I have visibility on hosts that can be problematic in the future.

I don't think LibreNMS is the proper tool for that specific mc* alerts:

  • It only have a 5min granularity
  • Doesn't integrates with Icinga
  • Only match servers using the switch port description
  • Can't easily display all the target server's bandwidth (eg. aggregate view)

Services behave differently when there is congestion. I think they all should alert, but with different time windows.
For example one service might need an emergency response after 30min saturating its uplink, and some a notification after a few hours.

Makes sense, I am now wondering if we should create a generic and configurable alarm or not :)

@CDanis this is an old task that I opened, do you think that we could revamp it and use what you have in mind to detect bursts in bandwidth usage? It would make a big difference in managing memcached.. I can offer my time/help in case you don't have much, even a prototype to see how it works would be a great start.

Change 588431 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Add NIC saturation exporter (Python implementation)

https://gerrit.wikimedia.org/r/588431

The patch now posted here is a reasonably-clean Python implementation of the same idea described in my now-long-ago comment at T239983#5719681:

This is generated by running ifstat 1 (so it polls the NIC kernel stats every second) and incrementing a counter whenever we see a second-long interval where NIC utilization was >=90% in either direction.

I think this approach -- running high-frequency sampling locally, looking for saturation of some resource, and then increment a counter when it happens, and have Prometheus scrape that counter -- is an interesting and useful thing to do in the general case, something we could think about for e.g. LVS CPU0 saturation.

I intend to start running this exporter on all memcache hosts at first, with the eventual goal of the entire fleet. (But I think we'd probably restrict alerting to just certain clusters, via either whitelist or blacklist -- we care about NIC saturation on lvs hosts and appservers, but probably not on analytics batch job worker hosts, for instance.)

Change 588431 merged by CDanis:
[operations/puppet@production] Add NIC saturation exporter (Python implementation)

https://gerrit.wikimedia.org/r/588431

Change 588760 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] puppetize nic_saturation_exporter & run on memcache hosts

https://gerrit.wikimedia.org/r/588760

Change 588760 merged by CDanis:
[operations/puppet@production] puppetize nic_saturation_exporter & run on memcache hosts

https://gerrit.wikimedia.org/r/588760

Change 589067 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] fix NIC saturation exporter to be jessie-compatible ๐Ÿ˜–

https://gerrit.wikimedia.org/r/589067

Change 589067 merged by CDanis:
[operations/puppet@production] fix NIC saturation exporter to be jessie-compatible ๐Ÿ˜–

https://gerrit.wikimedia.org/r/589067

Change 589070 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] jessie-ified nic_saturation_exporter on memcache hosts

https://gerrit.wikimedia.org/r/589070

Change 589070 merged by CDanis:
[operations/puppet@production] jessie-ified nic_saturation_exporter on memcache hosts

https://gerrit.wikimedia.org/r/589070

Change 589085 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] run nic_saturation_exporter on all hosts

https://gerrit.wikimedia.org/r/589085

Change 589277 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::prometheus::nic_saturation_exporter: pass through ensure param

https://gerrit.wikimedia.org/r/589277

Change 589277 merged by CDanis:
[operations/puppet@production] profile::prometheus::nic_saturation_exporter: pass through ensure param

https://gerrit.wikimedia.org/r/589277

Change 589085 merged by CDanis:
[operations/puppet@production] nic_saturation_exporter on all physical hosts w/ hiera enabled

https://gerrit.wikimedia.org/r/589085

I've been abusing this task for the rollout of nic_saturation_exporter to other hosts; moving tracking that to T250401

There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey is that right?)

We should probably make aggregated alerts for each, implemented as a single check_prometheus rule, so that they aren't too spammy.

An overall alert for NIC saturation for a few 'critical' clusters (memc, appserver/api, databases, cache_text, ?) is probably a good idea too.

Then I think we can call this done.

There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey is that right?)

Yep correct, we didn't add one yet!

We should probably make aggregated alerts for each, implemented as a single check_prometheus rule, so that they aren't too spammy.

An overall alert for NIC saturation for a few 'critical' clusters (memc, appserver/api, databases, cache_text, ?) is probably a good idea too.

+1, what I'd love to have is some alarm that raises alerts only if a sustained saturation is reached (as opposed to a temp spike).

Change 691216 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Revert "fix NIC saturation exporter to be jessie-compatible ๐Ÿ˜–"

https://gerrit.wikimedia.org/r/691216

@CDanis: Only https://gerrit.wikimedia.org/r/c/operations/puppet/+/691216 is still open on this ticket, should that be merged or abandoned? Thanks.

We haven't had any issues caused due to high memcached traffic for quite a long time. Our measures (gutter pool, onhost memcached, and of multi-DC), so far appear to help:)

Closing this task, will reopen if needed

An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutter pool, that is not something we wish for. It never really happened from a quick glance in metrics, but if we introduce big key/values it may very well happen without anything else breaking apart.

An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutter pool, that is not something we wish for. It never really happened from a quick glance in metrics, but if we introduce big key/values it may very well happen without anything else breaking apart.

Yes, that makes sense! We should keep it in mind when we do so (cc @Krinkle @aaron)