Page MenuHomePhabricator

k8s nodes sometimes getting bad token value from hiera
Closed, ResolvedPublic

Description

PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.309 second response time

root@tools-k8s-master-01:~# kubectl get nodes | grep -i not
tools-worker-1021.tools.eqiad.wmflabs   NotReady                      1y
tools-worker-1028.tools.eqiad.wmflabs   NotReady,SchedulingDisabled   220d
tools-worker-1029.tools.eqiad.wmflabs   NotReady,SchedulingDisabled   220d

Same issue with alert on notready node and then:

tools-worker-1007:~# tail /var/log/syslog
Oct 17 20:47:29 tools-worker-1007 kube-proxy[7153]: E1017 20:47:29.420941    7153 reflector.go:203] pkg/proxy/config/api.go:33: Failed to list *api.Endpoints: the server has asked for the client to provide credentials (get endpoints)
Oct 17 20:47:29 tools-worker-1007 kubelet[7241]: E1017 20:47:29.444322    7241 reflector.go:203] pkg/kubelet/kubelet.go:403: Failed to list *api.Node: the server has asked for the client to provide credentials (get nodes)
Oct 17 20:47:29 tools-worker-1007 kubelet[7241]: E1017 20:47:29.444390    7241 reflector.go:203] pkg/kubelet/config/apiserver.go:43: Failed to list *api.Pod: the server has asked for the client to provide credentials (get pods)

Puppet fixes this:

tools-worker-1007:~# puppet agent --test
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-worker-1007.tools.eqiad.wmflabs
Info: Applying configuration version '1508271953'
Notice: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]/content:
--- /etc/kubernetes/kubeconfig	2017-10-17 20:40:23.567365746 +0000
+++ /tmp/puppet-file20171017-7746-10ohqpn	2017-10-17 20:48:12.585317275 +0000
@@ -14,4 +14,4 @@
 users:
   - name: client-infrastructure
     user:
-      token: faketoken
+      token: <real token inserted here>

Info: Computing checksum on file /etc/kubernetes/kubeconfig
Info: FileBucket got a duplicate file {md5}97c5a61de4e04330c5cfa123d4408736
Info: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]: Filebucketed /etc/kubernetes/kubeconfig to puppet with sum 97c5a61de4e04330c5cfa123d4408736
Notice: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]/content: content changed '{md5}97c5a61de4e04330c5cfa123d4408736' to '{md5}8fff205d380602bf440d1e39960a5a8e'
Info: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]: Scheduling refresh of Service[kubelet]
Info: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]: Scheduling refresh of Service[kube-proxy]
Notice: /Stage[main]/K8s::Proxy/Service[kube-proxy]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/K8s::Kubelet/Service[kubelet]: Triggered 'refresh' from 1 events
Notice: Finished catalog run in 6.33 seconds

Event Timeline

chasemp added a project: Toolforge.

When I refreshed puppet on the affected host, it included this diff:

   - name: client-infrastructure
     user:
-      token: faketoken
+      token: <actual token>

So that explains why k8s broke there, but not why puppet suddenly changed the token and then changed it back on the next run. I can't reproduce, so closing this for now :/

chasemp renamed this task from Alert for 'All k8s worker nodes are healthy on checker.tools.wmflabs.org' to k8s nodes sometimes getting bad token value from hiera.Oct 17 2017, 8:50 PM
chasemp updated the task description. (Show Details)

~/git/wmf/labs/private
grep -Ri faketoken *
hieradata/labs/tools/common.yaml: token: faketoken
hieradata/labs/tools/common.yaml: token: faketoken
hieradata/labs/tools/common.yaml: token: faketoken

It is sometimes deciding to replace the value w/ the labs/private value on a node...

I can see the flaps:

tools-puppetmaster-01:/var# grep -Ri faketoken *

lib/puppet/reports/tools-worker-1007.tools.eqiad.wmflabs/201710172048.yaml:      message: "\n--- /etc/kubernetes/kubeconfig\t2017-10-17 20:40:23.567365746 +0000\n+++ /tmp/puppet-file20171017-7746-10ohqpn\t2017-10-17 20:48:12.585317275 +0000\n@@ -14,4 +14,4 @@\n users:\n   - name: client-infrastructure\n     user:\n-      token: faketoken\n+      token: <real_token>\n"

lib/puppet/reports/tools-worker-1007.tools.eqiad.wmflabs/201710170739.yaml:      message: "\n--- /etc/kubernetes/kubeconfig\t2017-10-17 06:40:13.599731559 +0000\n+++ /tmp/puppet-file20171017-17027-7ft9sj\t2017-10-17 07:39:31.578691485 +0000\n@@ -14,4 +14,4 @@\n users:\n   - name: client-infrastructure\n     user:\n-      token: faketoken\n+      token: <real_token>\n"

lib/puppet/reports/tools-worker-1007.tools.eqiad.wmflabs/201710172040.yaml:      message: "\n--- /etc/kubernetes/kubeconfig\t2017-10-17 07:39:31.638691731 +0000\n+++ /tmp/puppet-file20171017-6650-1iqstje\t2017-10-17 20:40:23.519365547 +0000\n@@ -14,4 +14,4 @@\n users:\n   - name: client-infrastructure\n     user:\n-      token: <real_token>\n+      token: faketoken\n"

lib/puppet/reports/tools-worker-1007.tools.eqiad.wmflabs/201710170640.yaml:      message: "\n--- /etc/kubernetes/kubeconfig\t2017-08-09 20:01:49.089350979 +0000\n+++ /tmp/puppet-file20171017-10946-pwagwb\t2017-10-17 06:40:13.535731228 +0000\n@@ -14,4 +14,4 @@\n users:\n   - name: client-infrastructure\n     user:\n-      token: <real_token>\n+      token: faketoken\n"

lib/puppet/reports/tools-worker-1021.tools.eqiad.wmflabs/201710170250.yaml:      message: "\n--- /etc/kubernetes/kubeconfig\t2017-10-11 14:29:20.463127565 +0000\n+++ /tmp/puppet-file20171017-17248-74kiqq\t2017-10-17 02:50:13.532267352 +0000\n@@ -14,4 +14,4 @@\n users:\n   - name: client-infrastructure\n     user:\n-      token: <real_token>\n+      token: faketoken\n"

lib/puppet/reports/tools-worker-1021.tools.eqiad.wmflabs/201710170320.yaml:      message: "\n--- /etc/kubernetes/kubeconfig\t2017-10-17 02:50:13.596267619 +0000\n+++ /tmp/puppet-file20171017-19397-iu1d26\t2017-10-17 03:19:57.746670533 +0000\n@@ -14,4 +14,4 @@\n users:\n   - name: client-infrastructure\n     user:\n-      token: faketoken\n+      token: <real_token>\n"

I'm pretty sure this is a wider issue as I've seen it before on deployment-prep with other hiera data

Current theory is that this happens when the labs-private repo is in the process of being rebased.

tldr from IRC. When the every 30m cron is triggered to update the labs/private repo on the Toolforge master (or probably any project specific master) it seems rebase sets aside the patches on top to pull and rebase to update. There is the race condition if a host runs and compiles puppet in that window of time. It seems any mechanism we do that is not rebase would solve this problem but it requires some thinking on how to keep local patches and not constantly be fighting conflicts, etc.

Just caught this again

#>ssh tools-k8s-master-01.eqiad.wmflabs 'kubectl get nodes'
    NAME                                    STATUS                        AGE
tools-worker-1001.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1002.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1003.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1004.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1005.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1006.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1007.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1008.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1009.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1010.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1011.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1012.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1013.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1014.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1015.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1016.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1017.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1018.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1019.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1020.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1021.tools.eqiad.wmflabs   NotReady                      1y
tools-worker-1022.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1023.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1025.tools.eqiad.wmflabs   Ready                         1y
tools-worker-1026.tools.eqiad.wmflabs   Ready                         262d
tools-worker-1027.tools.eqiad.wmflabs   Ready                         262d
tools-worker-1028.tools.eqiad.wmflabs   NotReady,SchedulingDisabled   233d
tools-worker-1029.tools.eqiad.wmflabs   NotReady,SchedulingDisabled   233d

~
#>ssh tools-worker-1021.tools.eqiad.wmflabs 'sudo puppet agent --test'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-worker-1021.tools.eqiad.wmflabs
Info: Applying configuration version '1508883753'
Notice: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]/content:
--- /etc/kubernetes/kubeconfig    2017-10-24 22:20:10.365544457 +0000
+++ /tmp/puppet-file20171024-22885-1qfiyau    2017-10-24 22:25:06.590755549 +0000
@@ -14,4 +14,4 @@
 users:
   - name: client-infrastructure
     user:
-      token: faketoken
+      token: <real token>

Info: Computing checksum on file /etc/kubernetes/kubeconfig
Info: FileBucket got a duplicate file {md5}97c5a61de4e04330c5cfa123d4408736
Info: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]: Filebucketed /etc/kubernetes/kubeconfig to puppet with sum 97c5a61de4e04330c5cfa123d4408736
Notice: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]/content: content changed '{md5}97c5a61de4e04330c5cfa123d4408736' to '{md5}8fff205d380602bf440d1e39960a5a8e'
Info: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]: Scheduling refresh of Service[kubelet]
Info: /Stage[main]/K8s::Infrastructure_config/File[/etc/kubernetes/kubeconfig]: Scheduling refresh of Service[kube-proxy]
Notice: /Stage[main]/K8s::Proxy/Service[kube-proxy]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/K8s::Kubelet/Service[kubelet]: Triggered 'refresh' from 1 events
Notice: Finished catalog run in 6.25 seconds

Change 386318 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] git-sync-upstream: rewrite in python

https://gerrit.wikimedia.org/r/386318

Change 386331 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] git-sync-upstream: perform rebase in a separate, temporary workdir

https://gerrit.wikimedia.org/r/386331

Change 386318 merged by Andrew Bogott:
[operations/puppet@production] git-sync-upstream: rewrite in python

https://gerrit.wikimedia.org/r/386318

Change 386331 merged by Andrew Bogott:
[operations/puppet@production] git-sync-upstream: perform rebase in a separate, temporary workdir

https://gerrit.wikimedia.org/r/386331