Page MenuHomePhabricator

tools-manifest: WebServiceMonitor load on NFS
Closed, ResolvedPublic

Description

/usr/bin/collector-runner WebServiceMonitor wakes up every 10 seconds and does a search looking for files in /data/project/*/service.manifest

tools-service-02 generates around 400-500 NFS ops/sec. We'd rather avoid that load on our NFS server, considering the performance problems we experience.

It seems the reaction time of 10 seconds might afford some tweaking to better match what our environment can really offer.

There's additional load on LDAP services but I haven't measured those. Network throughput from LDAP requests is higher than NFS, so changing the wake up time would also help with that.

Event Timeline

tools-services-02# nfsstat -c -l -Z60
nfs v4 client        total:    25995 
------------- ------------- --------
nfs v4 client         read:        4 
nfs v4 client        write:        6 
nfs v4 client         open:    10450 
nfs v4 client    open_conf:     1372 
nfs v4 client        close:     7300 
nfs v4 client      setattr:        3 
nfs v4 client        renew:        3 
nfs v4 client       access:     6686 
nfs v4 client      getattr:      117 
nfs v4 client       lookup:       41 
nfs v4 client       rename:        3 
nfs v4 client       statfs:       10

Can we cache the result? I mean, like bigbrother, the purpose of webservicemonitor is to restart one or more jobs when they exit unexpectedly. This 'unexpected exit' are likely unattended, and the jobs likely have been running for a long time with the file set. I think it would be fine to do the manifest refreshes at a rate of, say once an hour or every few hours. Only checking the job status do every 10 seconds.

Change 475355 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/software/tools-manifest@master] toolforge: Increase WebServiceMonitor sleep time to 60s

https://gerrit.wikimedia.org/r/475355

Change 475355 merged by GTirloni:
[operations/software/tools-manifest@master] toolforge: Increase WebServiceMonitor sleep time to 60s

https://gerrit.wikimedia.org/r/475355

Mentioned in SAL (#wikimedia-cloud) [2018-11-26T17:39:21Z] <gtirloni> updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) (T210190)

GTirloni triaged this task as Medium priority.