Create an alert for high memcached bw usage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	May 28 2019, 8:44 AM

Description

After fixing T223310, @elukey proposed we create a bandwidth alert for all mc* hosts. This will help us identify changes early enough about keys that could possibly exhaust our host's bw

Details

Subject	Repo	Branch	Lines +/-
Revert "fix NIC saturation exporter to be jessie-compatible 😖"	operations/puppet	production	+22 -14
profile::prometheus::nic_saturation_exporter: pass through ensure param	operations/puppet	production	+22 -7
nic_saturation_exporter on all physical hosts w/ hiera enabled	operations/puppet	production	+4 -3
jessie-ified nic_saturation_exporter on memcache hosts	operations/puppet	production	+58 -0
fix NIC saturation exporter to be jessie-compatible 😖	operations/puppet	production	+13 -21
puppetize nic_saturation_exporter & run on memcache hosts	operations/puppet	production	+58 -0
Add NIC saturation exporter (Python implementation)	operations/puppet	production	+182 -0

Customize query in gerrit

Related Objects

Mentioned In: T250401: run nic_saturation_exporter on all physical hosts
Mentioned Here: T250401: run nic_saturation_exporter on all physical hosts
T223310: Investigate increase in tx bandwidth usage for mc1033

Event Timeline

jijiki created this task.May 28 2019, 8:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 28 2019, 8:44 AM

jijiki added subscribers: Anomie, Krinkle, aaron.May 28 2019, 8:44 AM

jijiki triaged this task as Medium priority.May 28 2019, 9:11 AM

jijiki updated the task description. (Show Details)

Couple of notes:

We'd need to write a meaningful runbook to instruct people what metrics to check (mcrouter, redis, etc..)
Refactor https://grafana.wikimedia.org/dashboard/db/redis to show per host usage metrics (rather than only aggregated results). In https://phabricator.wikimedia.org/T223310 it was clear that the aggregated metrics for Redis ops usage are too coarse grained to show any outlier.

elukey added a project: User-Elukey.May 28 2019, 9:24 AM

• kchapman moved this task from Inbox, needs triage to Radar on the Performance-Team board.May 28 2019, 7:54 PM

• kchapman edited projects, added Performance-Team (Radar); removed Performance-Team.

Dzahn added a project: observability.May 28 2019, 7:57 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.May 30 2019, 6:52 PM

elukey moved this task from Backlog to Mcrouter/Memcached on the User-Elukey board.Jul 5 2019, 6:53 AM

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

In T224454#5307968, @elukey wrote:

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

I'm not very familiar with the problem, but my suggestion would be to alert on either on symptoms (ideally as experienced by users) or as high level as reasonably possible. In this case redis was involved, thus alarming on at least redis metrics makes more sense to me, or maybe even higher level like mediawiki? My two cents though, it is possible these options have been explored and discarded already! re: swift bandwidth metrics, those are referred to in the grafana dashboard not in an alert, I'm not sure I understand

re: bandwidth itself, I believe we do have port utilization alerts based on librenms (cc @ayounsi) though e.g. I don't know at what threshold etc.

In T224454#5308149, @fgiunchedi wrote:

In T224454#5307968, @elukey wrote:

@fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any suggestion about how it is best to proceed? I'd like to create this alarm sooner rather than later, since it can prevent outages :) Should we create something generic that multiple hosts/clusters could reuse?

I'm not very familiar with the problem, but my suggestion would be to alert on either on symptoms (ideally as experienced by users) or as high level as reasonably possible. In this case redis was involved, thus alarming on at least redis metrics makes more sense to me, or maybe even higher level like mediawiki? My two cents though, it is possible these options have been explored and discarded already! re: swift bandwidth metrics, those are referred to in the grafana dashboard not in an alert, I'm not sure I understand

The difficult bit is that we don't have good visibility about how "expensive" in term of tx bandwidth commands to redis/memcached are. For example, in this case there was a huge increase in requests to Redis, but it might also happen that a particular low-rate GET triggers a huge response that fills the tx bandwidth. What I'd like to have is a generic alarm for bandwidth usage, very coarse grain but effective to say spot a regression after a mediawiki deployment or similar. Didn't think about librenms, could be something to investigate!

@ayounsi thoughts? :)

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 8 2019, 1:08 PM

In T224454#5308149, @fgiunchedi wrote:

re: bandwidth itself, I believe we do have port utilization alerts based on librenms (cc @ayounsi) though e.g. I don't know at what threshold etc.

I have something in LibreNMS: "Access port utilization over 80% for 1h". But not set to alert, it's mostly used as a FYI, so I have visibility on hosts that can be problematic in the future.

I don't think LibreNMS is the proper tool for that specific mc* alerts:

It only have a 5min granularity
Doesn't integrates with Icinga
Only match servers using the switch port description
Can't easily display all the target server's bandwidth (eg. aggregate view)

Services behave differently when there is congestion. I think they all should alert, but with different time windows.
For example one service might need an emergency response after 30min saturating its uplink, and some a notification after a few hours.

Makes sense, I am now wondering if we should create a generic and configurable alarm or not :)

@CDanis this is an old task that I opened, do you think that we could revamp it and use what you have in mind to detect bursts in bandwidth usage? It would make a big difference in managing memcached.. I can offer my time/help in case you don't have much, even a prototype to see how it works would be a great start.

CDanis claimed this task.Apr 3 2020, 12:12 PM

Change 588431 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Add NIC saturation exporter (Python implementation)

https://gerrit.wikimedia.org/r/588431

gerritbot added a project: Patch-For-Review.Apr 14 2020, 4:05 PM

The patch now posted here is a reasonably-clean Python implementation of the same idea described in my now-long-ago comment at T239983#5719681:

This is generated by running ifstat 1 (so it polls the NIC kernel stats every second) and incrementing a counter whenever we see a second-long interval where NIC utilization was >=90% in either direction.

I think this approach -- running high-frequency sampling locally, looking for saturation of some resource, and then increment a counter when it happens, and have Prometheus scrape that counter -- is an interesting and useful thing to do in the general case, something we could think about for e.g. LVS CPU0 saturation.

I intend to start running this exporter on all memcache hosts at first, with the eventual goal of the entire fleet. (But I think we'd probably restrict alerting to just certain clusters, via either whitelist or blacklist -- we care about NIC saturation on lvs hosts and appservers, but probably not on analytics batch job worker hosts, for instance.)

Change 588431 merged by CDanis:
[operations/puppet@production] Add NIC saturation exporter (Python implementation)

https://gerrit.wikimedia.org/r/588431

Maintenance_bot removed a project: Patch-For-Review.Apr 14 2020, 5:11 PM

Change 588760 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] puppetize nic_saturation_exporter & run on memcache hosts

https://gerrit.wikimedia.org/r/588760

gerritbot added a project: Patch-For-Review.Apr 14 2020, 5:37 PM

Change 588760 merged by CDanis:
[operations/puppet@production] puppetize nic_saturation_exporter & run on memcache hosts

https://gerrit.wikimedia.org/r/588760

Maintenance_bot removed a project: Patch-For-Review.Apr 15 2020, 5:10 PM

Change 589067 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] fix NIC saturation exporter to be jessie-compatible 😖

https://gerrit.wikimedia.org/r/589067

gerritbot added a project: Patch-For-Review.Apr 15 2020, 5:19 PM

Change 589067 merged by CDanis:
[operations/puppet@production] fix NIC saturation exporter to be jessie-compatible 😖

https://gerrit.wikimedia.org/r/589067

Change 589070 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] jessie-ified nic_saturation_exporter on memcache hosts

https://gerrit.wikimedia.org/r/589070

Change 589070 merged by CDanis:
[operations/puppet@production] jessie-ified nic_saturation_exporter on memcache hosts

https://gerrit.wikimedia.org/r/589070

Maintenance_bot removed a project: Patch-For-Review.Apr 15 2020, 6:11 PM

Change 589085 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] run nic_saturation_exporter on all hosts

https://gerrit.wikimedia.org/r/589085

gerritbot added a project: Patch-For-Review.Apr 15 2020, 6:33 PM

Change 589277 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::prometheus::nic_saturation_exporter: pass through ensure param

https://gerrit.wikimedia.org/r/589277

Change 589277 merged by CDanis:
[operations/puppet@production] profile::prometheus::nic_saturation_exporter: pass through ensure param

https://gerrit.wikimedia.org/r/589277

Change 589085 merged by CDanis:
[operations/puppet@production] nic_saturation_exporter on all physical hosts w/ hiera enabled

https://gerrit.wikimedia.org/r/589085

I've been abusing this task for the rollout of nic_saturation_exporter to other hosts; moving tracking that to T250401

Maintenance_bot removed a project: Patch-For-Review.Apr 16 2020, 4:10 PM

There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey is that right?)

We should probably make aggregated alerts for each, implemented as a single check_prometheus rule, so that they aren't too spammy.

An overall alert for NIC saturation for a few 'critical' clusters (memc, appserver/api, databases, cache_text, ?) is probably a good idea too.

Then I think we can call this done.

In T224454#6269950, @CDanis wrote:

There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey is that right?)

Yep correct, we didn't add one yet!

We should probably make aggregated alerts for each, implemented as a single check_prometheus rule, so that they aren't too spammy.

An overall alert for NIC saturation for a few 'critical' clusters (memc, appserver/api, databases, cache_text, ?) is probably a good idea too.

+1, what I'd love to have is some alarm that raises alerts only if a sustained saturation is reached (as opposed to a temp spike).

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:48 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:02 PM

Change 691216 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Revert "fix NIC saturation exporter to be jessie-compatible 😖"

https://gerrit.wikimedia.org/r/691216

gerritbot added a project: Patch-For-Review.May 14 2021, 2:11 PM

@CDanis: Only https://gerrit.wikimedia.org/r/c/operations/puppet/+/691216 is still open on this ticket, should that be merged or abandoned? Thanks.

Joe moved this task from 🔦Unused2 to 💾 Datastores on the serviceops board.Oct 6 2022, 5:52 AM

We haven't had any issues caused due to high memcached traffic for quite a long time. Our measures (gutter pool, onhost memcached, and of multi-DC), so far appear to help:)

Closing this task, will reopen if needed

An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutter pool, that is not something we wish for. It never really happened from a quick glance in metrics, but if we introduce big key/values it may very well happen without anything else breaking apart.

In T224454#8411988, @elukey wrote:

An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutter pool, that is not something we wish for. It never really happened from a quick glance in metrics, but if we introduce big key/values it may very well happen without anything else breaking apart.

Yes, that makes sense! We should keep it in mind when we do so (cc @Krinkle @aaron)

Create an alert for high memcached bw usageClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Create an alert for high memcached bw usage
Closed, ResolvedPublic
Actions