We're trying to reduce the total number of NFS clients, and unmounting /data/project from all hosts except the ones that absolutely need it (which would be the mw* hosts and the varnishes, perhaps?) would help.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| beta: Move parsoid logs off NFS | operations/puppet | production | +2 -3 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • AlexMonk-WMF | T102953 Completely remove Beta Cluster dependency on NFS | |||
| Resolved | yuvipanda | T125624 Disable /data/project for instances in deployment-prep that do not need it |
Event Timeline
So:
- deployment-mediawiki01/02
- deployment-tmh01
- deployment-upload
- deployment-cache-upload04
These seem to be the only ones that need it. I'll await confirmation from someone in the releng team and then disable NFS on the rest.
Last time I checked, a various backend services wrote their logs to /data/project so it can be read from anywhere. Though that might just be parsoid:
manifests/role/parsoid.pp: $parsoid_log_file = '/data/project/parsoid/parsoid.log'
To be checked with Parsoid folks. Maybe it now runs on the shared instances nowadays and relay everything to syslog and logstash.
All instances having MediaWiki (app server, job runners etc) do hit NFS to grab images/thumbnails. I have lost track of which instances are running jobs though. Just from operations/mediawiki-config.git:
wmf-config/CommonSettings-labs.php: $wgCaptchaDirectory = '/data/project/upload7/private/captcha/random'; wmf-config/CommonSettings-labs.php: $wgCaptchaDirectory = '/data/project/upload7/private/captcha'; wmf-config/CommonSettings-labs.php: $wgMathDirectory = '/data/project/upload7/math'; wmf-config/CommonSettings-labs.php: $wgScoreDirectory = '/data/project/upload7/score'; wmf-config/InitialiseSettings-labs.php: 'default' => '/data/project/upload7/$site/$lang', wmf-config/InitialiseSettings-labs.php: 'private' => '/data/project/upload7/private/$lang', wmf-config/filebackend-labs.php: 'deletedDir' => "/data/project/upload7/private/archive/$site/$lang", wmf-config/filebackend-labs.php: 'directory' => '/data/project/upload7/wikipedia/commons', wmf-config/filebackend-labs.php: 'basePath' => "/data/project/upload7/private/gwtoolset/$site/$lang"
All that crap would be gone once we have Swift on beta (T64835) that can be removed and deployment-upload nuke.
So in summary, to my knowledge instances still having NFS actually require it (beside Parsoid).
On the NFS server, is it possible to have a breakdown of per instances NFS hits for deployment-prep project? Maybe that would identify other instances that hit it and do not strictly needs NFS.
If anything is writing logs to NFS we must make sure it stops as soon as
possible - I'll check with the parsoid people :)
ALL instances have NFS now - zotero, restbase, urldownloader, bastion, etc
- and most of them don't need it. Outside of parsoid, all the MW hosts you
mentioned I covered in my list I think (except jobrunners, which I'll add).
Once I verify with the parsoid team, I'll get rid of NFS From the non MW
instances.
Listed the NFS types via salt/df/magic:
root@deployment-salt:~ # salt -v --out=txt '*' cmd.run "df -t nfs 2>/dev/null|grep -v ^Filesystem|cut -d\ -f7" Executing job with jid 20160204193722579006 ------------------------------------------- deployment-bastion.deployment-prep.eqiad.wmflabs: /data/project deployment-db1.deployment-prep.eqiad.wmflabs: /data/project deployment-db2.deployment-prep.eqiad.wmflabs: /data/project deployment-fluorine.deployment-prep.eqiad.wmflabs: /data/project deployment-kafka02.deployment-prep.eqiad.wmflabs: /data/project deployment-memc02.deployment-prep.eqiad.wmflabs: /data/project deployment-memc03.deployment-prep.eqiad.wmflabs: /data/project deployment-memc04.deployment-prep.eqiad.wmflabs: /data/project deployment-poolcounter01.deployment-prep.eqiad.wmflabs: /data/project deployment-salt.deployment-prep.eqiad.wmflabs: /data/project deployment-upload.deployment-prep.eqiad.wmflabs: /data/project
Salt is totally lying, because it's totally mounted on more instances than in that list :) from deployment-restbase01:
labstore.svc.eqiad.wmnet:/project/deployment-prep/project on /data/project type nfs4 (rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.68.16.128,lookupcache=none,local_lock=none,addr=10.64.37.10)
The way to disable this would be to set mount_nfs hiera variable to false in Hiera:deployment-prep page on wikitech and turn it on only for the specific hosts it is needed in.
Logs should be going via rsyslog forwarding to deployment-fluorine and into the beta logstash server as well using basically the same setup as we use in production. Both use local instance storage.
This is unrelated to swift, since this is only removing the mounts from instances that do not use them at all and is effectively a no-op.
Am going to do this now.
https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=311783&oldid=311781 turns off NFS for new instances! Am turning it on for the instances that need it now.
I've set mount_nfs: true on all the instances I listed earlier.
These are all still no-ops, since I actually need to unmount them now :)
I've unmounted /data/project on deployment-poolcounter01 and verified that puppet doesn't bring it back. I'll unmount it on all the other instances shortly.
I've tried to unmount it from the following instances:
deployment-analytics03.eqiad.wmflabs deployment-analytics02.eqiad.wmflabs deployment-analytics01.eqiad.wmflabs deployment-tin.eqiad.wmflabs deployment-restbase01.eqiad.wmflabs deployment-sca02.eqiad.wmflabs deployment-sca01.eqiad.wmflabs deployment-mathoid.eqiad.wmflabs deployment-ms-be02.eqiad.wmflabs deployment-ms-be01.eqiad.wmflabs deployment-ms-fe01.eqiad.wmflabs deployment-sentry01.eqiad.wmflabs deployment-conftool.eqiad.wmflabs deployment-kafka04.eqiad.wmflabs deployment-aqs01.eqiad.wmflabs deployment-eventlogging04.eqiad.wmflabs deployment-cache-parsoid05.eqiad.wmflabs deployment-conf03.eqiad.wmflabs deployment-poolcounter01.eqiad.wmflabs deployment-eventlogging03.eqiad.wmflabs deployment-cache-mobile04.eqiad.wmflabs deployment-cache-text04.eqiad.wmflabs deployment-puppetmaster.eqiad.wmflabs mira.eqiad.wmflabs deployment-logstash2.eqiad.wmflabs deployment-fluorine.eqiad.wmflabs deployment-restbase02.eqiad.wmflabs deployment-zookeeper01.eqiad.wmflabs deployment-kafka02.eqiad.wmflabs deployment-zotero01.eqiad.wmflabs deployment-urldownloader.eqiad.wmflabs deployment-elastic08.eqiad.wmflabs deployment-elastic07.eqiad.wmflabs deployment-elastic06.eqiad.wmflabs deployment-elastic05.eqiad.wmflabs deployment-parsoid05.eqiad.wmflabs deployment-apertium01.eqiad.wmflabs deployment-cxserver03.eqiad.wmflabs deployment-mx.eqiad.wmflabs deployment-redis02.eqiad.wmflabs deployment-redis01.eqiad.wmflabs deployment-pdf02.eqiad.wmflabs deployment-sentry2.eqiad.wmflabs deployment-pdf01.eqiad.wmflabs deployment-stream.eqiad.wmflabs deployment-db2.eqiad.wmflabs deployment-memc04.eqiad.wmflabs deployment-db1.eqiad.wmflabs deployment-salt.eqiad.wmflabs deployment-memc03.eqiad.wmflabs deployment-memc02.eqiad.wmflabs deployment-bastion.eqiad.wmflabs
Success except in:
deployment-parsoid05 deployment-pdf02 deployment-cache-text04 deployment-cache-mobile04 deployment-cache-parsoid05 deployment-ms-fe01 deployment-sca01
Everything except deployment-parsoid05 and deployment-sca01 has been handled. Parsoid seems to be writing logs to NFS >_>
Change 271183 had a related patch set uploaded (by Yuvipanda):
beta: Move parsoid logs off NFS
So all instances that do not need NFS do not have NFS anymore! Woo! :D
If someone is building a new server that does need NFS for whatever reason, you can set the mount_nfs: True line in Hiera:deployment-prep/host/<hostname> and run puppet and it'll mount /data/project.
I marked it again as blocked on the root task T102953: Completely remove Beta Cluster dependency on NFS, it is part of it.
And Swift is T64835: Setup a Swift cluster on beta-cluster to match production itself a child task of the root task T102953.
Positive side of this is this graph, showing the NFS client drops: http://graphite.wikimedia.org/render/?width=948&height=576&_salt=1455738188.847&target=servers.labstore1001.nfsd.clients
Bah, since I didn't delete them from /etc/fstab they would be back when restarted, which happened to all instances. Will need to do this again.