Page MenuHomePhabricator

Dumps not accessible from container pods
Closed, ResolvedPublic

Description

Folder /public/dumps/public/eswiki is not accessible from container pods jdk8 or jdk11 in ToolLabs server.

In particular, /public/dumps is accessible but not the subfolder /public/dumps/public.

However, the folder is accessible from the regular tool account, before starting the pod.

This folder was accessible from the mentioned pods one month ago.

Thanks in advance,

Event Timeline

$ webservice jdk11 shell
$ ls -alh /public/dumps
total 16K
drwxr-xr-x 2 root root 4.0K Mar 13 20:36 .
drwxr-xr-x 3 root root 4.0K Apr 27 02:36 ..
lrwxrwxrwx 1 root root   52 Mar 13 20:36 incr -> /mnt/nfs/dumps-labstore1007.wikimedia.org/other/incr
lrwxrwxrwx 1 root root   68 Mar 13 20:36 pagecounts-all-sites -> /mnt/nfs/dumps-labstore1007.wikimedia.org/other/pagecounts-all-sites
lrwxrwxrwx 1 root root   62 Mar 13 20:36 pagecounts-raw -> /mnt/nfs/dumps-labstore1007.wikimedia.org/other/pagecounts-raw
lrwxrwxrwx 1 root root   57 Mar 13 20:36 pageviews -> /mnt/nfs/dumps-labstore1007.wikimedia.org/other/pageviews
lrwxrwxrwx 1 root root   42 Mar 13 20:36 public -> /mnt/nfs/dumps-labstore1007.wikimedia.org/
$ ls -lh /mnt
total 0

This is a bind mount problem, and almost certainly not restricted by Kubernetes container type. We are mounting the /public/dumps directory from the Kubernetes exec node into the container, but that directory is only a collection of symlinks to other mounts on the exec node. Because we are not also exposing these additional mounts to the container the symlinks go nowhere.

I'm not quite sure what the right fix is here. I can see three initial options:

  1. Mount /mnt/nfs/dumps-labstore1007.wikimedia.org instead of /public/dumps. This would be fragile because things would break badly when labstore1007 is taken offline for maintenance and labstore1006 is expected to take over.
  2. Mount each of the directories under /public/dumps individually into the container. This would work because resolving the symlink would be done on the exec host side of things. I'm pretty sure this would work because it is how /data/project ends up being mounted. That is also a symlink on the exec node, but the symlinked volume shows up inside the container as expected. What I don't know is if this would also have the NFS primary failover problem or not. (If not does that mean that /data/project is also resistant to NFS primary failover?)
  3. Mount both /mnt/nfs/dumps-labstore1006.wikimedia.org and /mnt/nfs/dumps-labstore1007.wikimedia.org in those locations inside the container in addition to mounting the symlink farm in /public/dumps. This would make the containers most directly like the bastions and exec nodes. I think this would preserve operations inside a container in the event of symlink switching on the underlying exec node.

@Bstorm does any of these seem "best" to you? Or better yet do you have a more clever idea of how to fix this?

Ahhhh, I understand what's up now. What changed is exports on the dumps servers.

The original solution was to mount the specific symlinks into the containers just like we do for the /home and /data/project dirs. With the old exports, I believe this would have just worked because /public/dumps was also a symlink so the symlinks under that were still valid.

All that said, there were problems when the symlinks changed because they were deleted and recreated. Let me test what happens on a cluster of this version when the container mounts a hostdir that is a symlink that goes away and comes back. If that works now, let's revive solutions #2

If it doesn't work, #3 could have unintended consequences as well because of the kernel difficulty in letting go of a mount. #1 would be way too fragile. On the last failovers, it worked alright, but let's test quick.

Ok, so what the container does when the symlink changes is that it retains the original definition of the symlink. That means it effectively will not failover until the pod is deleted/restarted. This explains why shutting down a dumps server caused endlessly rising load on the kubernetes cluster in the past. The symlinks cannot be forgotten by a running pod.

My comment in T247455#6086002 would seem to support #3 as the option to go with. That will require a bit more work than it first appears.
The steps are:

  1. Update maintain-kubeusers to modify new PSPs to allow mounting of the /mnt/nfs/dumps-labstore100?.wikimedia.org dirs in the pod
  2. Update the pod-preset for new tools in maintain-kubeusers to include those volumes
  3. Backfill all existing PSPs and pod-presets (one each per tool) to use that new volume as well with a script or something.
  4. Restart any webservices that need the dumps mounts (because the pod-preset change will change the pods regardless of the deployment AFAIK)

This may fix the last mystery about dumps failover, honestly.

Change 592747 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] tests: fix cassette generation and add some testing

https://gerrit.wikimedia.org/r/592747

Change 592747 merged by Bstorm:
[labs/tools/maintain-kubeusers@master] tests: fix cassette generation and add some testing

https://gerrit.wikimedia.org/r/592747

Ok, so I manually patched a PSP in toolsbeta to allow the hostfile path /mnt/nfs for the test tool as a read-only volume. Then I manually patched the vols and mounts in the deployment:

toolsbeta.test@toolsbeta-sgebastion-04:~$ kubectl exec -it test-86457849c7-h8ckj -- ls /public/dumps/public
10wikipedia		      mhwiki
404.html		      mhwiktionary
aawiki			      minwiki
aawikibooks		      minwiktionary
aawiktionary		      mirrors.html
abwiki			      miwiki
abwiktionary		      miwikibooks
acewiki			      miwiktionary
advisorywiki		      mkwiki
adywiki			      mkwikibooks
afwiki			      mkwikimedia
afwikibooks		      mkwikisource
afwikiquote		      mkwiktionary
afwiktionary		      mlwiki

So there's a PoC. I think that enabling /mnt/nfs read-only as the path prefix for this purpose only seems ok. Alternatively, we can hardcode the mounts of /mnt/nfs/dumps-labstore1006.wikimedia.org and /mnt/nfs/dumps-labstore1007.wikimedia.org in there, but I dislike the idea that this may lock us into those host names in the PSPs. As is, it'll be hardcoded into the pod-presets. Doing so does not affect the read-write status of the tool project dir or other volume mounts despite the symlinking because the container is pretty specific about how it mounts each vol.

Rebuilding the cassettes for the patch before I put it up for review. The change to maintain-kubeusers will only affect new users, but it is a good place to discuss the change.

Changing everyone's pod-preset and psp is more of a shell script kinda deal, I think.

Change 592786 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] k8s-dumps: make the symlinks for dumps NFS work inside toolforge k8s

https://gerrit.wikimedia.org/r/592786

Going to test patching every psp and preset in toolsbeta that has certain labels to see if that works instead of scripting. It might be smoother and easier to replicate in the future.

So, patching psps works great as long as you either use fancy json patching or you replace the whole array (the latter option is fine). You cannot update a pod preset, apparently, in v1.15 come Hell or high water. It nicely replies with 200s, but the object will never change no matter if you use patch or kubectl apply -f. I suspect that I'm going to have to delete and recreate each one.

Even kubectl replace has no effect on a podpreset in this version. That's maddening. Delete and create/apply does work.

Change 592786 merged by jenkins-bot:
[labs/tools/maintain-kubeusers@master] k8s-dumps: make the symlinks for dumps NFS work inside toolforge k8s

https://gerrit.wikimedia.org/r/592786

Mentioned in SAL (#wikimedia-cloud) [2020-04-28T22:58:04Z] <bstorm_> rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta T247455

Mentioned in SAL (#wikimedia-cloud) [2020-04-29T16:52:14Z] <bstorm_> tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge T247455

Mentioned in SAL (#wikimedia-cloud) [2020-04-29T16:54:25Z] <bstorm_> deleted the maintain-kubeusers pod to start running the new image T247455

Ok, now I just need to apply the update to existing tools.

Ok, I think this should do it

1#!/bin/bash
2# Run this script with your root/cluster admin account as appropriate.
3# This will fix the dumps mounts for all existing tools.
4
5set -Eeuo pipefail
6
7function check-ns(){
8 ns=$1
9 preset=$(kubectl -n "$ns" get podpresets mount-toolforge-vols -o yaml)
10 if [[ $preset =~ ^.*/mnt/nfs/.*$ ]]
11 then
12 return 1
13 else
14 return 0
15 fi
16}
17
18declare -a namespaces
19readarray -t namespaces < <(kubectl get ns -l tenancy=tool --no-headers=true -o custom-columns=:metadata.name)
20
21for ns in "${namespaces[@]}"
22do
23 echo "Starting for $ns"
24 if check-ns "$ns"; then
25 echo "Deleting preset for $ns"
26 kubectl -n "$ns" delete podpresets mount-toolforge-vols
27 cat <<EOF | kubectl apply -f -
28apiVersion: settings.k8s.io/v1alpha1
29kind: PodPreset
30metadata:
31 name: mount-toolforge-vols
32 namespace: $ns
33spec:
34 env:
35 - name: HOME
36 value: /data/project/${ns:5}
37 selector:
38 matchLabels:
39 toolforge: tool
40 volumeMounts:
41 - mountPath: /public/dumps
42 name: dumps
43 readOnly: true
44 - mountPath: /mnt/nfs/dumps-labstore1007.wikimedia.org
45 name: dumpsrc1
46 readOnly: true
47 - mountPath: /mnt/nfs/dumps-labstore1006.wikimedia.org
48 name: dumpsrc2
49 readOnly: true
50 - mountPath: /data/project
51 name: home
52 - mountPath: /etc/wmcs-project
53 name: wmcs-project
54 readOnly: true
55 - mountPath: /data/scratch
56 name: scratch
57 - mountPath: /etc/ldap.conf
58 name: etcldap-conf
59 readOnly: true
60 - mountPath: /etc/ldap.yaml
61 name: etcldap-yaml
62 readOnly: true
63 - mountPath: /etc/novaobserver.yaml
64 name: etcnovaobserver-yaml
65 readOnly: true
66 - mountPath: /var/lib/sss/pipes
67 name: sssd-pipes
68 volumes:
69 - hostPath:
70 path: /public/dumps
71 type: Directory
72 name: dumps
73 - hostPath:
74 path: /mnt/nfs/dumps-labstore1007.wikimedia.org
75 type: Directory
76 name: dumpsrc1
77 - hostPath:
78 path: /mnt/nfs/dumps-labstore1006.wikimedia.org
79 type: Directory
80 name: dumpsrc2
81 - hostPath:
82 path: /data/project
83 type: Directory
84 name: home
85 - hostPath:
86 path: /etc/wmcs-project
87 type: File
88 name: wmcs-project
89 - hostPath:
90 path: /data/scratch
91 type: Directory
92 name: scratch
93 - hostPath:
94 path: /etc/ldap.conf
95 type: File
96 name: etcldap-conf
97 - hostPath:
98 path: /etc/ldap.yaml
99 type: File
100 name: etcldap-yaml
101 - hostPath:
102 path: /etc/novaobserver.yaml
103 type: File
104 name: etcnovaobserver-yaml
105 - hostPath:
106 path: /var/lib/sss/pipes
107 type: Directory
108 name: sssd-pipes
109EOF
110 echo "created new preset for $ns"
111 else
112 echo "skipping $ns preset -- already updated"
113 fi
114 kubectl patch psp "${ns}-psp" --patch '{"spec":{"allowedHostPaths":[{"pathPrefix":"/var/lib/sss/pipes"},{"pathPrefix":"/data/project"},{"pathPrefix":"/data/scratch"},{"pathPrefix":"/public/dumps","readOnly":true},{"pathPrefix":"/mnt/nfs","readOnly":true},{"pathPrefix":"/etc/wmcs-project","readOnly":true},{"pathPrefix":"/etc/ldap.yaml","readOnly":true},{"pathPrefix":"/etc/novaobserver.yaml","readOnly":true},{"pathPrefix":"/etc/ldap.conf","readOnly":true}]}}'
115 echo "Finished $ns"
116done
117
118echo "*********************"
119echo "Done!"

I'll run this in toolsbeta to find out.

Oops, forgot to sanitize the output from get ns.

Ok, after a couple more edits, that paste actually seems to do the correct thing 😅

Running a final test on a webservice now that it has run across toolsbeta.

Mentioned in SAL (#wikimedia-cloud) [2020-04-29T19:48:32Z] <bstorm_> ran the scary rewrite-psp-preset.sh script across toolsbeta T247455

Mentioned in SAL (#wikimedia-cloud) [2020-04-29T21:28:51Z] <bstorm_> running the rewrite-psp-preset.sh script across all tools T247455

Mentioned in SAL (#wikimedia-cloud) [2020-04-29T22:13:19Z] <bstorm_> running a fixup script after fixing a bug T247455

Fixed the error in the paste that caused the rework (for future reference).

Ok, at this point, you should be able to interact with dumps NFS in Toolforge Kubernetes in any restarted or newly started service.

Ok, at this point, you should be able to interact with dumps NFS in Toolforge Kubernetes in any restarted or newly started service.

I have restarted my tool in a jdk11 container and now the dumps are accessible again.

Thanks a lot!!

Bstorm claimed this task.