Page MenuHomePhabricator

Corrupt $HOME/.kube/config preventing use of Kubernetes for wikiquantos, wikiroupas, and possibly more tools
Closed, ResolvedPublicBUG REPORT

Description

What happens?:

  • Any terminal call makes this error (pastebin) appear. There's no error in the config.yaml file for Wikiquantos or Wikiroupas, as they were working until July 18, so I assume is not related to it.

Other information (browser name/version, screenshots, etc.):
Tools: https://wikiquantos.toolforge.org and https://wikiroupas.toolforge.org

Event Timeline

Even simpler reproduction:

$ sudo become wikiquantos
$ kubectl get po
error: error loading config file "/data/project/wikiquantos/.kube/config": yaml: line 21: could not find expected ':'
bd808 renamed this task from Webservice error in toolforge prevents some tools to work properly to Corrupt $HOME/.kube/config preventing use of Kuberenetes for wikiquantos, wikiroupas, and possibly more tools.Aug 15 2023, 8:31 PM
Ederporto renamed this task from Corrupt $HOME/.kube/config preventing use of Kuberenetes for wikiquantos, wikiroupas, and possibly more tools to Corrupt $HOME/.kube/config preventing use of Kubernetes for wikiquantos, wikiroupas, and possibly more tools.Aug 15 2023, 8:34 PM
$ ls -lh /data/project/wikiquantos/.kube/config
-rw------- 1 tools.wikiquantos tools.wikiquantos 1.9K Jul 28 23:27 /data/project/wikiquantos/.kube/config

line 21 is the last line in that file and reads:

ata/project/wikiquantos/.toolskube/client.key

/data/project/wikiroupas/.kube/config was last touched at a similar time (Jul 28 22:54) and has a matching end of file corruption (modulo the project name in the string).

I was able to fix /data/project/wikiquantos/.kube/config manually. The generated file had 3 obvious differences from a knonw working config:

  • relative path in the value for the users[0].user.client-certificate key
  • relative path in the value for the users[0].user.client-key key
  • corrupt line 21 data

The line 21 data was almost the desired value for the users[0].user.client-key item.

I have applied the same manual fixes to /data/project/wikiroupas/.kube/config as well.

I created a new "t344289-test" tool to test for ongoing $HOME/.kube/config corruption from the maintain-kubeusers process. The new tool has a working config file as demonstrated by kubectl describe quota. The relative path values for users[0].user.client-{certificate,key} that I reported in T344289#9094523 are present in this config file, so they are not part of whatever corruption happened to the wikiquantos and wikiroupas tools config files.

I have found 3 additional config files with the same corruption:

$ ssh tools-nfs.svc.tools.eqiad1.wikimedia.cloud
$ cd /srv/tools/project
$ sudo grep '^ata/project' */.kube/config
blockaround/.kube/config:ata/project/blockaround/.toolskube/client.key
entityshape/.kube/config:ata/project/entityshape/.toolskube/client.key
readability/.kube/config:ata/project/readability/.toolskube/client.key
$ ls -lh blockaround/.kube/config entityshape/.kube/config readability/.kube/config
-rw------- 1 tools.blockaround tools.blockaround 1.9K Jul 14 06:59 blockaround/.kube/config
-rw------- 1 tools.entityshape tools.entityshape 1.9K Jul 16 23:44 entityshape/.kube/config
-rw------- 1 tools.readability tools.readability 1.9K Aug 10 11:12 readability/.kube/config

All 3 files have now been corrected.

bd808 claimed this task.

This is really the same bug as T344289: Corrupt $HOME/.kube/config preventing use of Kubernetes for wikiquantos, wikiroupas, and possibly more tools. Apparently the corruption of those config files included different truncations of the extra line's path. Here it starts /project/ when the prior set discovered was ata/project. I think I can do another round of checking where I just look for any lines that are not key: value hash entries in the yaml.

I checked a known good $HOME/.kube/config and found that grep -v ':' $HOME/.kube/config does not match any lines. I used that to check for other corruptions as I did in T344289#9094586 with the other pattern:

$ ssh tools-nfs.svc.tools.eqiad1.wikimedia.cloud
$ cd /srv/tools/project
$ sudo grep -lv ':' */.kube/config | wc -l
1798

This is a much more pervasive problem than I originally assumed.

I started out fixing the files manually using vim: sudo vim $(sudo grep -lv ':' */.kube/config). I setup a search within vim to find lines that do not contain a colon: /\v^.+(^.*:.*$)@<!$. Then I could repeat the search with n to jump to the next likely corrupt line, dd to delete the line, and :w|bd to write the file to disk and move to the next buffer. I fixed about 150 files this way before deciding that I would die of boredom before finishing the queue.

The reason for manual editing was that it looked like the grep was finding a small number of files with a different cause of lines without an embedded :. I didn't want to blindly delete these lines with some sed or awk magic; instead I wanted to manually examine them.

After some thinking I decided to automate cleaning the general case error of the last line of the $HOME/.kube/config being substring of $HOME/.kube/.toolskube/client.key. Experimentation convinced me that sudo grep -v ':' */.kube/config | grep 'client' | cut -d: -f 1 was matching the set of files I wanted. Deleting the last line of a file can be done with sed --in-place '$d' <path>.

$ sudo grep -v ':' */.kube/config | grep 'client' | cut -d: -f 1 | wc -l
1502
$ sudo grep -v ':' */.kube/config | grep 'client' | cut -d: -f 1 | sudo xargs sed --in-place '$d'
$ sudo grep -v ':' */.kube/config | grep 'client' | cut -d: -f 1 | wc -l
0

After that I returned to manually editing the remaining files:

$ sudo vim $(sudo grep -lv ':' */.kube/config)
145 files to edit
$ sudo grep -lv ':' */.kube/config | wc -l
0

Most of these "extra" 145 files had a corrupt value for the auth token from the legacy (2016) Kubernetes cluster.

https://gerrit.wikimedia.org/r/c/labs/tools/maintain-kubeusers/+/951481 fixes the bug that was causing this to happen. I manually fixed the broken cases with:

root@tools-nfs-2:/srv/tools/project# grep -v ':' */.kube/config | grep 'client' | cut -d: -f 1 | xargs sed --in-place '$d'
root@tools-nfs-2:/srv/tools/project# grep '^:' */.kube/config | cut -d: -f 1 | xargs sed --in-place '$d'