Page MenuHomePhabricator

Add timeout for NFS monitoring tools (prometheus and hopefully not diamond)
Closed, ResolvedPublic

Description

Ultimately, the reason high load happens on otherwise unused NFS shares on Toolforge appears to be because monitoring services keep trying to run against the share on the clients. When they fail, they appear to hang, filling the process table. The issue appears to be more pronounced on Trusty vs. Stretch, but that doesn't mean it shouldn't be fixed.

Event Timeline

Bstorm triaged this task as High priority.Mar 5 2019, 6:14 PM
Bstorm created this task.

https://github.com/prometheus/node_exporter/issues/1259

Basically, this is fixed in a later version than we use (we use 0.14 while it is fixed for NFS in 0.17), but they don't recommend monitoring NFS on the node with prometheus anyway. The other available fix is even later.

We can fix it quickly by filtering out NFS from prometheus checks (worth exploring) or upgrading wikimedia's version of node-exporter (also worth exploring).

It's already done! It just doesn't apply to Trusty: https://gerrit.wikimedia.org/r/489753

I have verified that this is pinned to 0.17 on Stretch nodes. That's why the Stretch grid didn't collapse during the problem.