Page MenuHomePhabricator

nfsiostat collector appears to be broken
Closed, ResolvedPublic

Description

When investigating intermittent issues with the admin Tool I noticed that the k8s node it is running on had a load spike that coincided:

https://graphite-labs.wikimedia.org/render/?width=724&height=560&_salt=1508952516.828&target=tools.tools-worker-1007.nfsiostat.labstore.ops&target=tools.tools-worker-1007.nfsiostat.labstore.ops_per_sec&target=tools.tools-worker-1007.loadavg.15&target=tools.tools-worker-1007.loadavg.05&target=tools.tools-worker-1007.loadavg.01

Screen Shot 2017-10-25 at 12.57.54 PM.png (560×724 px, 65 KB)


Load is a pretty well known proxy for instance local NFS overload and so I tried to pull up the client NFS stats in graphite and noticed they are missing.

I looked on an older host and have to go back nearly a year to see actual data :)

https://graphite-labs.wikimedia.org/render/?width=724&height=560&_salt=1508953130.629&target=tools.tools-webgrid-lighttpd-1414.nfsiostat.labstore.ops&target=cactiStyle(tools.tools-webgrid-lighttpd-1414.nfsiostat.labstore.ops_per_sec)&from=-900d

Screen Shot 2017-10-25 at 12.58.28 PM.png (557×723 px, 35 KB)

We have a collector that would push stats per NFS backend?

modules/diamond/files/collector/nfsiostat.py

This collector is a mess in that I was first attempting to convert (nfs-common: /usr/sbin/nfsiostat to a Diamond collector under duress and discovered some weirdness in how nfsiostat translates ongoing stats (i.e. not the same as say iostat). There was a time when this was a needed daily because the backend NFS setup was melting.

Based on the timeline I /think/ this dates back to a change in how NFS mounts are presented to clients.

It used to be that /data/project and other NFS mounts were the actual mount point and now we use symlinks and mount under /mnt/nfs:

/data/project -> /mnt/nfs/labstore-secondary-tools-project

Event Timeline

nfsiostat.py has

def get_default_config(self):
    """
    Returns the default collector settings
    """
    config = super(NfsiostatCollector, self).get_default_config()
    config.update({
        'enabled':  False,
        'path':     'nfsiostat',
        'devices': ['/home',
                    '/public/dumps',
                    '/data/scratch',
                    '/data/project'],
        'attributes': ['attr',
                       'read',
                       'write',
                       'getattr',
                       'lookup',
                       'dir_cache',
                       'page_stats']
    })

The directories listed as devices are all now symlinks, and should probably be changed to the underlying devices, which is all the mount_paths + project specific logic in https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/labs/nfsclient.pp.

We can override Diamond default config by supplying an external config file, which may be cleaner than continuing to hardcode this in the script (see DirectorySizeCollector - https://github.com/wikimedia/puppet/blob/9197b9c91ebb7e531705c18cded253363723266f/modules/role/manifests/labs/nfs/secondary.pp#L153 for example on doing this.

I think ideally this is smarter and figures out what local NFS mounts exist potentially. Other collector, for example disk space usage, do that and have whitelists/blacklists to manage including/excluding dynamically found resources.

+1 That sounds like the right thing to do

Mentioned in SAL (#wikimedia-operations) [2017-10-30T11:58:34Z] <arturo> depool tools-exec-1401 to test patch in T179024 --> aborrero@tools-bastion-03:~$ sudo exec-manage depool tools-exec-1401.tools.eqiad.wmflabs

Change 387243 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] diamond: nfsiostat: update collector to read from arbitrary NFS mount points

https://gerrit.wikimedia.org/r/387243

Mentioned in SAL (#wikimedia-operations) [2017-10-30T15:57:26Z] <arturo> depool again tools-exec-1401.tools.eqiad.wmflabs for more tests related to T179024

Change 387243 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] diamond: nfsiostat update collector to read from arbitrary NFS mount points

https://gerrit.wikimedia.org/r/387243

We run puppet merge also in the tools cluster.

We run puppet merge also in the tools cluster.

small note: this was run on the production puppetmasters