Page MenuHomePhabricator

Weird issue with the script
Closed, DeclinedPublic


At the end of a k8s worker upgrade on a fast connection, occasionally the node is replying with an unexpected version when checking the node config. That causes the script to stop processing there and doesn't uncordon the node during the upgrade.

We only ever saw it when ran the script relatively nearby to the datacenter with a higher speed connection. Rather than the typical running of the script, namely from across the Atlantic Ocean or on a mobile connection. The error could have been caused by something other than a fast connection race condition, but that's a working theory. This needs to be resolved before the next tools k8s upgrade.

Event Timeline

We could add a small timeout to retry the version check several times before erroing out.

A retry function could be added. I recommend we shelf this until we decide what is happening with magnum (which will possibly not be terribly applicable to our current toolforge setup). Ultimately the issue is a minor one, the problem it introduces is the need to manually uncordon a node before instructing the script to move on to the next node. Since the script still prompts to continue or not regardless, it isn't much more time spent doing an upgrade.

I'm going to close this. I believe I'm the only remaining person who can see this. Assuming the working theory is the reason behind the error, no one else is very likely to see this, and T326554 significantly reduces my exposure to this, which is a fairly minor thing to overcome (it involved running kubectl uncordon <node>)

rook updated the task description. (Show Details)