Both user $home directories and tool home directories are stored on an NFS server that runs on the bare-metal labstore1004 NFS server. Rather than refresh this hardware, we're moving NFS into the cloud realm where the data can be served by a VM NFS server.
The new NFS volume is 'tools-nfs' in the tools project, 10TB. It is attached to tools-nfs-2.tools.eqiad1.wikimedia.cloud which will server as the NFS server. I've already rsync'd from labstore1004 to the new volume, but tools nfs is very busy and falls out of sync quickly.
Migration steps:
[x] switch the mount on labstore1004 to read-only *toolforge outage begins*
```
# puppet agent --disable "switching nfs exports to read-only, forever"
# systemctl stop nfs-exportd.service
# vi /etc/exports.d/tools.exports # and replace -rw with -ro
# /usr/sbin/exportfs -ra
```
[x] reduce our workload a bit by truncating log files on labstore1004
```
root@labstore1004:/srv/tools/shared/tools# find . -name '*.out' -size +100M -exec truncate --size=10M {} \;
root@labstore1004:/srv/tools/shared/tools# find . -name '*.err' -size +100M -exec truncate --size=10M {} \;
root@labstore1004:/srv/tools/shared/tools# find . -name '*.log' -size +100M -exec truncate --size=10M {} \;
```
[x] run an rsync refresh from labstore1004 on tools-nfs-2:
```
rsync -av --delete --sparse /mnt/nfs/labstore-secondary-tools-home/ /srv/tools/home/ # should take 30-40 minutes
rsync -av --delete --sparse /mnt/nfs/labstore-secondary-tools-project/ /srv/tools/project/ # should take 10-12 hours
```
[x] check the immutability of replica.my.cnf, fix if necessary
```
root@tools-nfs-2:/srv/tools/project# find . -name replica.my.cnf -exec chattr +i {} \;
```
[x] update nfs export path https://gerrit.wikimedia.org/r/c/operations/puppet/+/904562
[x] update traffic-shaping setup with https://gerrit.wikimedia.org/r/c/operations/puppet/+/904627/1
[x] update server path for toolforge https://gerrit.wikimedia.org/r/c/operations/puppet/+/905229
[x] enable, force puppet run on new nfs server (gets exports set up)
[x] force puppet run and reboot on all toolforge VMs. *restores toolforge *
[x] update dns and take other steps (@dcaro please fill in) to reactivate maintain-dbusers
[x] ensure labstore1004 exports remain read-only https://gerrit.wikimedia.org/r/c/operations/puppet/+/904630
followup tasks:
[] redirect existing backup servers to backup from the new hosts
[x] ensure tool deletion still works (T170355)
[] prune out obsolete nfs-on-metal puppet code
[] move or replace nfs monitoring (read/write, disk space)