Page MenuHomePhabricator

toolforge/paws k8s containers need to know about clouddumps100[12]
Closed, ResolvedPublic

Description

Puppet has an elaborate setup to allow gracious addition of new dumps servers and switching between them, but all of that is lost in k8s which has hardcoded mount points.

Right now containers mount

/mnt/nfs/dumps-labstore1006.wikimedia.org
and
/mnt/nfs/dumps-labstore1007.wikimedia.org

They need to also mount

/mnt/nfs/dumps-clouddumps1001.wikimedia.org
and
/mnt/nfs/dumps-clouddumps1002.wikimedia.org

So that we can set

dumps_dist_active_vps: clouddumps1001.wikimedia.org

without pointing everything at a nonexistent mount.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

PAWS containers should start mounting /mnt/nfs/dumps-clouddumps100[12].wikimedia.org alongside /mnt/nfs/dumps-labstore100[67].wikimedia.org
Then we'll remove /mnt/nfs/dumps-labstore100[67].wikimedia.org entirely once everything is set?

Change 830681 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[cloud/toolforge/volume-admission-controller@main] Add mounts for the new dumps servers

https://gerrit.wikimedia.org/r/830681

PAWS containers should start mounting /mnt/nfs/dumps-clouddumps100[12].wikimedia.org alongside /mnt/nfs/dumps-labstore100[67].wikimedia.org
Then we'll remove /mnt/nfs/dumps-labstore100[67].wikimedia.org entirely once everything is set?

That all sounds correct to me, although I'm not clear on why the containers don't just go straight to /data/dumps instead.

So far as I have found PAWS mounts both labstores to /mnt/nfs but links to various parts of just labstore1007 (not labstore1006) from /public/dumps. This seems to be managed from:
dumps_share_root (modules/profile/manifests/wmcs/nfsclient.pp:211) -> dumps_active_server (modules/profile/manifests/wmcs/nfsclient.pp:13) -> dumps_dist_active_vps (hieradata/common.yaml:668), which is set to labstore1007.wikimedia.org

I would expect once that is updated /public/dumps in PAWS should be fine. I'll put in a patch for the other two direct links.

I see no /data/dumps.

Otherwise the puppet changes do seem to have taken effect. I am seeing /mnt/nfs/dumps-clouddumps100[12].wikimedia.org mounted to the VMs.

That all sounds correct to me, although I'm not clear on why the containers don't just go straight to /data/dumps instead.

/public/dumps contains symlinks to various locations within the /mnt/nfs/ shares. The target directory that the symlinks point to need to be mounted in the container so that the files can actually be found by traversing a symbolic link to the target directory and then to its contained files. We tried directly mounting the active NFS share inside /public/dumps in the containers previously. This works great... until the first failover to a secondary NFS origin. When that happens the containers lose connectivity to the share until they are recreated with the new share attached.

https://github.com/toolforge/paws/pull/199 is approved and ready to go. Let me know when you are going to update puppet, as the links on the nodes in /public/ will need updated which the patch on github won't do. Ideally a week in head so that I can send out a note to cloud-announce

Change 830681 merged by jenkins-bot:

[cloud/toolforge/volume-admission-controller@main] Add mounts for the new dumps servers

https://gerrit.wikimedia.org/r/830681

Mentioned in SAL (#wikimedia-cloud) [2022-09-26T13:56:30Z] <taavi> restart the 6 singleuser pods that don't have the new dumps mount points attached yet T317144