Page MenuHomePhabricator

Create an Icinga check to alert on packet dropped
Closed, DeclinedPublic

Description

While investigating T200563, I realized that some wdqs servers were dropping a significant number of packets. A quick look at prometheus node_network_receive_drop shows that a number of other servers see significant packet drop (ganeti, elasticsearch, restbase, thumbor, ...).

After a quick chat with @fgiunchedi, a proposed check could be:

  • check that number of packets dropped over a period P is below threshold T
  • run that check at a relatively low frequency F
  • P = 24h, T = 1k, F = 1/6h

Since this is a fleet wide check, feedback is welcomed before we implement this.
It would make sense to also add node_network_transmit_drop to check if there are transmit drops

Event Timeline

Gehel created this task.Oct 3 2018, 9:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2018, 9:45 AM
faidon added a subscriber: faidon.Oct 3 2018, 12:02 PM

Why wasn't this caught by check_ping? Is it actual packet loss?

Gehel added a comment.Oct 3 2018, 8:18 PM

I can only speculate at this point, but the packet loss seems be happening in burst, depending on the check interval on ping, we might miss it. Not sure if icinga alerts on the first lost ping or not.

Volans added a subscriber: Volans.Oct 4 2018, 9:02 AM

As a reminder be careful when adding and merging fleet-wide checks, I'm not sure how many more we can add without increasing too much Icinga load as 1 fleet wide check => 1300 checks ;)

Gehel added a comment.Oct 4 2018, 9:37 AM

As a reminder be careful when adding and merging fleet-wide checks, I'm not sure how many more we can add without increasing too much Icinga load as 1 fleet wide check => 1300 checks ;)

Yeah! We'll definitely keep this in mind! The mitigation is that we can run this check at a pretty low frequency (at least for a start).

Change 465450 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] base::monitoring::host: added prometheus check for network receive drops

https://gerrit.wikimedia.org/r/465450

Mathew.onipe triaged this task as Normal priority.Oct 10 2018, 2:05 PM
Mathew.onipe raised the priority of this task from Normal to High.
Mathew.onipe updated the task description. (Show Details)
Restricted Application added a project: Product-Analytics. · View Herald TranscriptOct 10 2018, 2:08 PM
Gehel added a comment.Oct 11 2018, 7:50 PM

General question on how to deploy this kind of change:

This will most probably trip on a number of nodes (I know that at least wdqs, elasticsearch and restbase have packets dropped that are not handled at the moment). How do we manage this check until the situation is under control? While it is probably hiding real issues in some cases, it does not seem to be blocking anything too badly (or we would have addressed the problem already). How do we ensure that a check like this does not generate too much noise when we enable it?

How do we ensure that a check like this does not generate too much noise when we enable it?

My understanding is that the check is looking only at the last 24h of data, so I don't expect too many hosts to trigger it. Of course if you want to check you could run the same query on prometheus for all hosts and check the result against the defined threshold.

Another option is to have a very high critical value and use as a warning value what we think should be the critical one, so they will appear in Icinga as warning but not spam on the IRC channel.

How do these packet losses manifest? Are we talking about packets being lost in flight, error counters in interfaces, or something else?

(I'm not super convinced that a fleet-wide Prometheus check makes sense for this, but I don't have the full picture yet so it's too early to say!)

Gehel added a comment.Oct 12 2018, 8:48 AM

The ones I have seen are relatively short burst of errors in error counters in interfaces on WDQS (node_network_receive_drop in prometheus). In the case of WDQS, it seems related to CPU starvation and seems to be actionnable (T206105). It looks to me like something we need to address more generally, but that's sufficiently out of my comfort zone that feedback is definitely welcomed!

I did a quick audit in eqiad (for starters) to preview how we'd be affected by the alert, in this way:

Just in eqiad there's 839 matches, so likely we'll need some filtering/tuning first

Just in eqiad there's 839 matches, so likely we'll need some filtering/tuning first

Some packet loss is expected, so we at least need to tune the thresholds. Playing with prometheus and raising the limit to 10K / 24h, the list is down to a more manageable 29 hosts. Of course, at some point we need to set a limit on what functionally make sense, not what looks better in icinga :) I'm not sure where that limit would be. It seems to me that 200K packet lost / 24h is probably an issue, 10K probably not. And that threshold might be different for different services.

Overall, we should probably address some of those high packet loss before enabling this check.

For the record, the services with the highest packet loss are:

  • elasticsearch
  • restbase
  • swift
  • graphite
  • kafka_jumbo

I would consider also making the threshold a percentage of the normal traffic.

I would consider also making the threshold a percentage of the normal traffic.

So obvious. Of course!

What should be the runbook/actions when this alert goes off?

Gehel added a comment.Oct 17 2018, 8:26 PM

What should be the runbook/actions when this alert goes off?

I don't think we can have a standard runbook to cover all cases, but packet loss over *some* threshold needs investigation. This should be a non paging alert.

As an example, in the case of WDQS, we had strange connection failures on some HTTP requests. We traced it down to packet loss, related to CPU contention on IRQ handling for the NIC (competing with the application) and the actions are enabling RPS and finding ways to bound CPU usage of blazegraph.

So I did some analysis with prometheus using elasticsearch nodes at eqiad and running the following queries to show packet drops against packets received

  1. increase(node_network_receive_packets{instance=~"elastic10[0-9][0-9]:9100"}[24h])
  2. increase(node_network_receive_drop{instance=~"elastic10[0-9][0-9]:9100"}[24h])

and here are some screenshots:
For packets recieved:

and for packet drops:

From screenshots above, we could easily say the packet drops are negligible hence, I would go with @Volans proposal of using percentage based threshold. Like if we begin to see > 0.5% of traffic as drops, then Icinga can start throwing alerts.
I would start making patches for elasticsearch and maps if this proposal makes sense.

To further clarify the point above about using percentage based threshold, here is a screenshot showing the percentage:

query: increase(node_network_receive_drop{instance=~"elastic10[0-9][0-9]:9100"}[24h])/increase(node_network_receive_packets{instance=~"elastic10[0-9][0-9]:9100"}[24h]) * 100

So > 0.5% might even be too much. Thoughts please

Gehel added a comment.Oct 25 2018, 8:33 AM

So this shows that we have less than 0.04% of packet loss on the elasticsearch eqiad cluster? I would expect a loss rate that low to not be an issue (which matches the fact that we don't see a functional issue on that cluster). The goal is not to raise alerts, it is to raise alerts if we reach a level that is problematic.

We had a strong suspicion that packet loss was causing issues on wdqs, so maybe we should go back in time and look at what the peaks were on the wdqs cluster. Or see if overall we have other nodes where packet loss ratio is higher.

Also note that loss might happen in spikes, which might be somewhat hidden by our 24h window.

Gehel closed this task as Declined.Nov 27 2018, 6:28 PM

The numbers above seem to indicate that we don't have a good signal / noise ratio, so an icinga check does not make much sense.

Change 465450 abandoned by Mathew.onipe:
base::monitoring::host: added icinga prometheus check for network drops

Reason:
not needed any more

https://gerrit.wikimedia.org/r/465450