Page MenuHomePhabricator

Toolforge on Kubernetes: Broken symlink to dumps
Closed, ResolvedPublicBUG REPORT

Description

When running toolforge tools on Kubernetes, the pod should have a symlink /public/dumps/public pointing to a mounted NFS directory. Currently, the target NFS directory seems to be getting mounted into the pod, but at a slightly different mount point than the symlink’s target location.

Steps to reproduce:

$ ssh sascha@dev.toolforge.org
$ become qrank
$ cat nfs-test.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: qrank.nfs-test
  namespace: tool-qrank
  labels:
    name: qrank.nfs-test
    toolforge: tool
spec:
  template:
    metadata:
      labels:
        name: qrank.nfs-test
        toolforge: tool
    spec:
      restartPolicy: Never
      containers:
      - name: test
        workingDir: /data/project/qrank
        image: docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
        command: [ "ls", "-l", "/mnt/nfs", "/public/dumps/public" ]

$ kubectl apply --validate=true -f nfs-test.yaml
job.batch/qrank.nfs-test created

$ kubectl logs jobs/qrank.nfs-test -f
lrwxrwxrwx 1 root root   42 Mar 13  2020 /public/dumps/public -> /mnt/nfs/dumps-labstore1007.wikimedia.org/

/mnt/nfs:
total 72
drwxr-xr-x 1005 400 400 36864 Feb 22 10:30 dumps-labstore1006.wikimedia.org
drwxr-xr-x 1005 400 400 36864 Feb 22 10:31 dumps-labstore1007.wikimedia.orgs

Expected: /public/dumps/public should be a working symlink.

Observed: /public/dumps/public is a broken symlink because the mount point in /mnt/nfs on the pod is dumps-labstore1007.wikimedia.orgs instead of dumps-labstore1007.wikimedia.org. Note the final s in the directory name.

Temporary workaround: Configure custom tools to read dumps from /mnt/nfs/dumps-labstore1007.wikimedia.orgs/ (with final s) instead of /public/dumps/public.

Event Timeline

Change 666453 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] pod-presets: correct the mount for the dumps on labstore1007

https://gerrit.wikimedia.org/r/666453

Change 666453 merged by jenkins-bot:
[labs/tools/maintain-kubeusers@master] pod-presets: correct the mount for the dumps on labstore1007

https://gerrit.wikimedia.org/r/666453

Ok, the problem is fixed going forward, but we still need to backport the fix into old PodPresets

Mentioned in SAL (#wikimedia-cloud) [2021-02-27T02:00:12Z] <bstorm> running a script to repair the dumps mount in all podpresets T275371

I put together a script and ran it in toolsbeta to make sure it behaved well first. I have this running in tools now:

#!/bin/bash
# Run this script with your root/cluster admin account as appropriate.
# This will fix the dumps mounts for all existing tools.

set -Eeuo pipefail

declare -a namespaces
readarray -t namespaces < <(kubectl get ns -l tenancy=tool --no-headers=true -o custom-columns=:metadata.name)

for ns in "${namespaces[@]}"
do
    if [[ $(kubectl get podpreset -n "${ns}" --template='{{range .spec.volumeMounts}}{{ if eq .mountPath "/mnt/nfs/dumps-labstore1007.wikimedia.orgs" }}true{{end}}{{end}}' mount-toolforge-vols) == "true" ]]; then
        echo "Fixing ${ns}"
        kubectl -n "$ns" delete podpresets mount-toolforge-vols
        cat <<EOF | kubectl --namespace "$ns" apply -f -
apiVersion: settings.k8s.io/v1alpha1
kind: PodPreset
metadata:
  name: mount-toolforge-vols
  namespace: ${ns}
spec:
  env:
  - name: HOME
    value: /data/project/${ns:5}
  selector:
    matchLabels:
      toolforge: tool
  volumeMounts:
  - mountPath: /public/dumps
    name: dumps
    readOnly: true
  - mountPath: /mnt/nfs/dumps-labstore1007.wikimedia.org
    name: dumpsrc1
    readOnly: true
  - mountPath: /mnt/nfs/dumps-labstore1006.wikimedia.org
    name: dumpsrc2
    readOnly: true
  - mountPath: /data/project
    name: home
  - mountPath: /etc/wmcs-project
    name: wmcs-project
    readOnly: true
  - mountPath: /data/scratch
    name: scratch
  - mountPath: /etc/ldap.conf
    name: etcldap-conf
    readOnly: true
  - mountPath: /etc/ldap.yaml
    name: etcldap-yaml
    readOnly: true
  - mountPath: /etc/novaobserver.yaml
    name: etcnovaobserver-yaml
    readOnly: true
  - mountPath: /var/lib/sss/pipes
    name: sssd-pipes
  volumes:
  - hostPath:
      path: /public/dumps
      type: Directory
    name: dumps
  - hostPath:
      path: /mnt/nfs/dumps-labstore1007.wikimedia.org
      type: Directory
    name: dumpsrc1
  - hostPath:
      path: /mnt/nfs/dumps-labstore1006.wikimedia.org
      type: Directory
    name: dumpsrc2
  - hostPath:
      path: /data/project
      type: Directory
    name: home
  - hostPath:
      path: /etc/wmcs-project
      type: File
    name: wmcs-project
  - hostPath:
      path: /data/scratch
      type: Directory
    name: scratch
  - hostPath:
      path: /etc/ldap.conf
      type: File
    name: etcldap-conf
  - hostPath:
      path: /etc/ldap.yaml
      type: File
    name: etcldap-yaml
  - hostPath:
      path: /etc/novaobserver.yaml
      type: File
    name: etcnovaobserver-yaml
  - hostPath:
      path: /var/lib/sss/pipes
      type: Directory
    name: sssd-pipes
EOF
        echo "created new preset for $ns"
        echo "Finished $ns"
    fi
done

echo "*********************"
echo "Done!"

I'm recording this here partly because I may need to do something very similar if there are any mistakes in there.

Ok, @Sascha please check if that fixed your tool's setup. If it did, it also fixed things for everyone else.

I copied your test file so that I could use the same commands as it had before and ran

tools.qrank@tools-sgebastion-08:~/prod$ kubectl apply --validate=true -f bstorm-nfs-test.yaml
job.batch/qrank.bstorm-nfs-test created
tools.qrank@tools-sgebastion-08:~/prod$ kubectl logs jobs/qrank.bstorm-nfs-test -f
lrwxrwxrwx 1 root root   42 Mar 13  2020 /public/dumps/public -> /mnt/nfs/dumps-labstore1007.wikimedia.org/

/mnt/nfs:
total 72
drwxr-xr-x 1008 400 400 36864 Mar  1 19:20 dumps-labstore1006.wikimedia.org
drwxr-xr-x 1008 400 400 36864 Mar  1 17:45 dumps-labstore1007.wikimedia.org

This appears that the pod preset is working now. Closing.

Yes, it works now. Thank you!