Page MenuHomePhabricator

Try out removing the nfs client probes from the node exporter on VMs
Closed, ResolvedPublic

Description

The up-down monitor on our new metricsinfra prometheus flaps often because the node exporter gets wedged. Based on repeated investigations, this seems to be caused by NFS hangs on the client side.

Based on https://github.com/prometheus/node_exporter/issues/578 and later on https://github.com/prometheus/node_exporter/pull/1166, it seems like the best upstream aims to do to fix this is cause the exporter to return 503s instead of blowing up the host during such events. That is a reasonable thing to do, but it also makes the host report down and stop reporting all of our metrics.

It seems like a good idea to monitor NFS at the host level instead of on the client to avoid wedging our entire monitoring setup when the somewhat frequent issue of NFS client hangs comes up.

Event Timeline

Bstorm renamed this task from Try out removing the nfs client probes from VM the node exporter on VMs to Try out removing the nfs client probes from the node exporter on VMs.Fri, May 8, 11:29 PM
Bstorm created this task.

Mentioned in SAL (#wikimedia-cloud) [2020-05-09T00:28:50Z] <bstorm_> added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera T252260

Bstorm claimed this task.Sat, May 9, 12:29 AM
Bstorm triaged this task as Medium priority.
Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

This seems to have stopped the bleeding for the flapping "instance down" notifications.

Change 596063 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud-node-exporter: ignore NFS on the cloud client side

https://gerrit.wikimedia.org/r/596063

Change 596063 merged by Bstorm:
[operations/puppet@production] cloud-node-exporter: ignore NFS on the cloud client side

https://gerrit.wikimedia.org/r/596063

Bstorm closed this task as Resolved.Wed, May 13, 11:47 PM

I think it is safe to say this fixed the flapping instance-down notification.