Page MenuHomePhabricator

Weird issue with the wmcs-k8s-node-upgrade.py script
Open, LowestPublic

Description

At the end of a k8s worker upgrade on a fast connection, occasionally the node is replying with an unexpected version when checking the node config. That causes the script to stop processing there and doesn't uncordon the node during the upgrade.

We only ever saw it when @mdipietro ran the script, which is quite likely due to having a much faster connection that literally anyone else who ever ran the script (generally either across the Atlantic Ocean or on a mobile connection). The error could have been caused by something other than a fast connection race condition, but that's a working theory. This needs to be resolved before the next tools k8s upgrade.

Event Timeline

We could add a small timeout to retry the version check several times before erroing out.

A retry function could be added. I recommend we shelf this until we decide what is happening with magnum (which will possibly not be terribly applicable to our current toolforge setup). Ultimately the issue is a minor one, the problem it introduces is the need to manually uncordon a node before instructing the script to move on to the next node. Since the script still prompts to continue or not regardless, it isn't much more time spent doing an upgrade.