Page MenuHomePhabricator

Error joining new worker node to Toolforge Kubernetes cluster
Closed, ResolvedPublicSecurity

Description

I built tools-k8s-worker-[6-14] and now I want to join them to the cluster.

$ ssh root@tools-k8s-control-1.tools.eqiad.wmflabs
$ kubeadm token create
54uehz.m8phs2y9tubxp92o
$ openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
1cbcba20a201006b0359d5884e94567a07a8d809adcc8ad4f8402a64f57ad45b
$ exit

$ ssh root@tools-k8s-worker-6.tools.eqiad.wmflabs
$ kubeadm join k8s.tools.eqiad1.wikimedia.cloud:6443 --token 54uehz.m8phs2y9tubxp92o --discovery-token-ca-cert-hash sha256:1cbcba20a201006b0359d5884e94567a07a8d809adcc8ad4f8402a64f57ad45b
[preflight] Running pre-flight checks
        [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.5. Latest validated version: 18.09
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to decode cluster configuration data: v1beta2.ClusterConfiguration.APIServer: v1beta2.APIServer.ControlPlaneComponent: ExtraVolumes: []v1beta2.HostPathMount: decode slice: expect [ or n, but found {, error found in #10 byte of ...|Volumes":{"hostPath"|..., bigger context ...|E_ECDSA_WITH_AES_256_GCM_SHA384"},"extraVolumes":{"hostPath":"/etc/kubernetes/admission","mountPath"|...

Event Timeline

I wonder if this hunk of the config map (kubectl -n kube-system get cm kubeadm-config -oyaml):

extraVolumes:
  name: "/etc/kubernetes/admission"
  hostPath: "/etc/kubernetes/admission"
  mountPath: "/etc/kubernetes/admission"
  readOnly: true
  pathType: Directory

should really look like:

extraVolumes:
  - name: "/etc/kubernetes/admission"
    hostPath: "/etc/kubernetes/admission"
    mountPath: "/etc/kubernetes/admission"
    readOnly: true
    pathType: Directory
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Never saw this error before. But yes, a syntax error could make sense. The next question would be how that was accepted by the API in the first place or how is the API producing a non-valid YAML.

Can this error be reproduced in toolsbeta?

That is not non-valid YAML. That is invalid config. The API doesn't care what goes in a configMap, and that is only validated when passed through kubeadm. When I changed the cluster design, I had to update it by hand. I made a mistake, clearly.

I'm checking if that extraVolumes takes an array or hash in the source documentation (because it is only documented there).

Fixing in toolsbeta and puppet if it works :)

Change 562532 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-k8s: switch extraVolumes to an array

https://gerrit.wikimedia.org/r/562532

$ kubeadm join k8s.tools.eqiad1.wikimedia.cloud:6443 --token 54uehz.m8phs2y9tubxp92o --discovery-token-ca-cert-hash sha256:1cbcba20a201006b0359d5884e94567a07a8d809adcc8ad4f8402a64f57ad45b
[preflight] Running pre-flight checks
        [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.5. Latest validated version: 18.09
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Activating the kubelet service
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

Mentioned in SAL (#wikimedia-cloud) [2020-01-07T15:35:35Z] <bstorm_> changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster T242067

bd808 raised the priority of this task from High to Needs Triage.Jan 7 2020, 3:42 PM
bd808 set Security to Software security bug.
bd808 added a project: acl*security.
bd808 changed the visibility from "Public (No Login Required)" to "Custom Policy".
bd808 changed the subtype of this task from "Task" to "Security Issue".

Tokens are live

bd808 assigned this task to Bstorm.
bd808 triaged this task as High priority.
bd808 removed a project: Patch-For-Review.

Turns out deleting a bootstrap token is dead easy:

root@tools-k8s-control-1:~# kubeadm token delete 54uehz.m8phs2y9tubxp92o
bootstrap token "54uehz" deleted
bd808 changed the visibility from "Custom Policy" to "Public (No Login Required)".

Change 562532 merged by Bstorm:
[operations/puppet@production] toolforge-k8s: switch extraVolumes to an array

https://gerrit.wikimedia.org/r/562532