Weird issue with the wmcs-k8s-node-upgrade.py script
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Bstorm
	Sep 7 2021, 10:21 PM

Description

At the end of a k8s worker upgrade on a fast connection, occasionally the node is replying with an unexpected version when checking the node config. That causes the script to stop processing there and doesn't uncordon the node during the upgrade.

We only ever saw it when ran the script relatively nearby to the datacenter with a higher speed connection. Rather than the typical running of the script, namely from across the Atlantic Ocean or on a mobile connection. The error could have been caused by something other than a fast connection race condition, but that's a working theory. This needs to be resolved before the next tools k8s upgrade.

Related Objects
Search...

Status	Assigned	Task
Open	None	T362869 [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30)
Open	None	T362868 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29
Open	None	T362867 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28
Open	None	T359641 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27
Open	None	T327025 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26
Open	None	T316107 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25
Resolved	aborrero	T307651 Upgrade Toolforge Kubernetes to version 1.24
Resolved	• taavi	T298005 Upgrade Toolforge Kubernetes to version 1.23
Resolved	• taavi	T286856 Upgrade Toolforge Kubernetes to latest 1.22
Resolved	rook	T308172 Upgrade PAWS to Kubernetes 1.21
Resolved	• taavi	T282942 Upgrade Toolforge Kubernetes to latest 1.21
Resolved	rook	T291913 Upgrade PAWS kubernetes to 1.20
Resolved	rook	T280402 Upgrade Toolforge Kubernetes to latest 1.20
Resolved	rook	T287399 Upgrade PAWS kubernetes to 1.19
Declined	None	T290531 Weird issue with the wmcs-k8s-node-upgrade.py script

Event Timeline

• Bstorm created this task.Sep 7 2021, 10:21 PM

• Bstorm removed a project: PAWS.

• Bstorm added a parent task: T280402: Upgrade Toolforge Kubernetes to latest 1.20.

We could add a small timeout to retry the version check several times before erroing out.

A retry function could be added. I recommend we shelf this until we decide what is happening with magnum (which will possibly not be terribly applicable to our current toolforge setup). Ultimately the issue is a minor one, the problem it introduces is the need to manually uncordon a node before instructing the script to move on to the next node. Since the script still prompts to continue or not regardless, it isn't much more time spent doing an upgrade.

agreed.

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:46 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

I'm going to close this. I believe I'm the only remaining person who can see this. Assuming the working theory is the reason behind the error, no one else is very likely to see this, and T326554 significantly reduces my exposure to this, which is a fairly minor thing to overcome (it involved running kubectl uncordon <node>)

rook closed this task as Declined.Jan 19 2023, 1:54 PM

rook updated the task description. (Show Details)

Weird issue with the wmcs-k8s-node-upgrade.py scriptClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Weird issue with the wmcs-k8s-node-upgrade.py script
Closed, DeclinedPublic
Actions

Related Objects
Search...