Page MenuHomePhabricator

nfsiostat collector appears to be broken
Closed, ResolvedPublic


When investigating intermittent issues with the admin Tool I noticed that the k8s node it is running on had a load spike that coincided:

Screen Shot 2017-10-25 at 12.57.54 PM.png (560×724 px, 65 KB)

Load is a pretty well known proxy for instance local NFS overload and so I tried to pull up the client NFS stats in graphite and noticed they are missing.

I looked on an older host and have to go back nearly a year to see actual data :)

Screen Shot 2017-10-25 at 12.58.28 PM.png (557×723 px, 35 KB)

We have a collector that would push stats per NFS backend?


This collector is a mess in that I was first attempting to convert (nfs-common: /usr/sbin/nfsiostat to a Diamond collector under duress and discovered some weirdness in how nfsiostat translates ongoing stats (i.e. not the same as say iostat). There was a time when this was a needed daily because the backend NFS setup was melting.

Based on the timeline I /think/ this dates back to a change in how NFS mounts are presented to clients.

It used to be that /data/project and other NFS mounts were the actual mount point and now we use symlinks and mount under /mnt/nfs:

/data/project -> /mnt/nfs/labstore-secondary-tools-project

Event Timeline has

def get_default_config(self):
    Returns the default collector settings
    config = super(NfsiostatCollector, self).get_default_config()
        'enabled':  False,
        'path':     'nfsiostat',
        'devices': ['/home',
        'attributes': ['attr',

The directories listed as devices are all now symlinks, and should probably be changed to the underlying devices, which is all the mount_paths + project specific logic in

We can override Diamond default config by supplying an external config file, which may be cleaner than continuing to hardcode this in the script (see DirectorySizeCollector - for example on doing this.

I think ideally this is smarter and figures out what local NFS mounts exist potentially. Other collector, for example disk space usage, do that and have whitelists/blacklists to manage including/excluding dynamically found resources.

+1 That sounds like the right thing to do

Mentioned in SAL (#wikimedia-operations) [2017-10-30T11:58:34Z] <arturo> depool tools-exec-1401 to test patch in T179024 --> aborrero@tools-bastion-03:~$ sudo exec-manage depool

Change 387243 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] diamond: nfsiostat: update collector to read from arbitrary NFS mount points

Mentioned in SAL (#wikimedia-operations) [2017-10-30T15:57:26Z] <arturo> depool again for more tests related to T179024

Change 387243 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] diamond: nfsiostat update collector to read from arbitrary NFS mount points

We run puppet merge also in the tools cluster.

We run puppet merge also in the tools cluster.

small note: this was run on the production puppetmasters