Improve unmount/relink setup for dumps (labstore1006/1007) failovers
Open, Needs TriagePublic

Description

labstore1006 and 1007 failover with a symlink, primarily when things go badly. However, currently bad mounts are not firmly removed, requiring manual intervention, etc.

This has been brought to our attention by T196651, where disk faults caused read-only remounts on both systems.
First, documenting clush commands and actions to take will handle most cases.
Then, look into any way to automate the process, if possible.

Bstorm created this task.Jun 28 2018, 8:22 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2018, 8:22 PM

This also needs to remove from fstab. I really suspect we should have nfs mounts managed outside of the normal puppet mount provider and not have them automounted on boot but instead through some explicit puppet action as otherwise a missing NFS server is really tricky to recover from.

Andrew added a subscriber: Andrew.Jun 30 2018, 4:50 PM

We just noticed that a ton of VMs were still blocking on access to labstore1006 even though it was removed from puppet. Cleanup involved

  1. restarting the nfs daemon on labstore1006 so that things could talk to it long enough to detach
  2. purging the labstore1006 line from fstab
  3. umounting things everywhere
  • also, action item for me: check into mount option possibilities to make this work better

Change 444914 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] dumps: add labstore1006 back for dumps serving to cloud

https://gerrit.wikimedia.org/r/444914

Change 444914 merged by Rush:
[operations/puppet@production] dumps: add labstore1006 back for dumps serving to cloud

https://gerrit.wikimedia.org/r/444914

Change 444914 merged by Rush:
[operations/puppet@production] dumps: add labstore1006 back for dumps serving to cloud

https://gerrit.wikimedia.org/r/444914

root@wdqs-test:~# puppet agent --test
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for wdqs-test.wikidata-query.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1531241369'
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Mount[/mnt/nfs/dumps-labstore1006.wikimedia.org]/ensure: defined 'ensure' as 'defined'
Info: Computing checksum on file /etc/fstab
Info: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Mount[/mnt/nfs/dumps-labstore1006.wikimedia.org]: Scheduling refresh of Mount[/mnt/nfs/dumps-labstore1006.wikimedia.org]
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Mount[/mnt/nfs/dumps-labstore1006.wikimedia.org]: Triggered 'refresh' from 1 events
Info: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Mount[/mnt/nfs/dumps-labstore1006.wikimedia.org]: Scheduling refresh of Mount[/mnt/nfs/dumps-labstore1006.wikimedia.org]
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Exec[ensure-nfs-labstore1006.wikimedia.org]/returns: mounting /mnt/nfs/dumps-labstore1006.wikimedia.org
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Exec[ensure-nfs-labstore1006.wikimedia.org]/returns: mount.nfs: trying text-based options 'vers=4,bg,intr,sec=sys,proto=tcp,port=0,lookupcache=all,nofsc,soft,timeo=300,retrans=3,addr=208.80.154.7,clientaddr=10.68.23.175'
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Exec[ensure-nfs-labstore1006.wikimedia.org]/returns: /mnt/nfs/dumps-labstore1006.wikimedia.org is mounted.
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Exec[ensure-nfs-labstore1006.wikimedia.org]/returns: /mnt/nfs/dumps-labstore1006.wikimedia.org seems healthy.
Notice: /Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[labstore1006.wikimedia.org]/Exec[ensure-nfs-labstore1006.wikimedia.org]/returns: executed successfully
Notice: /Stage[main]/Wdqs/Exec[/srv/wdqs/blazegraph/wikidata.jnl exists]/returns: executed successfully
Notice: Applied catalog in 9.25 seconds

Change 445408 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] dumps: point cloud vps hosts at labstore1006

https://gerrit.wikimedia.org/r/445408

Change 445408 merged by Bstorm:
[operations/puppet@production] dumps: point cloud vps hosts at labstore1006

https://gerrit.wikimedia.org/r/445408

On the topic of NFS mounts, as it relates to issues we've seen unmounting things, for reference: https://access.redhat.com/solutions/157873

intr is deprecated. This, coupled with the back-off behavior we have configured have both caused us some problems.

Small note we had major troubles today w/ 1007 going away and though instances by all accounts should not have cared they did and load was out of control. It sure seemed like instances were trying to mount 1007 but from what directive I don't understand. /etc/fstab had the entry removed and puppet as disabled. Rebooting a tools-worker and cheating to make it appear no NFS mounts were at that IP seeemd to settle things but I don't feel like I know much about what is going on

ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up

So beyond the crappy failover story for NFS on these, we also have a kind of sloppy failover for the web using DNS since the certificate is only maintained up-to-date via acme on the live web server (though it is less likely to break everything we love).
This is just a note.