Page MenuHomePhabricator

toolforge k8s control plane freezing and other stability issues
Open, HighPublic

Description

Promoting wikis to a new version on the Wikimedia cluster is done by using scap deploy-promote. That depends on the toolforge service train-blockers to find the current Phabricator task for this week train. Running the commands yields a 500 from toolforge:

09:18:37 deploy-promote failed: <HTTPError> 500 Server Error: Internal Server Error for url: https://train-blockers.toolforge.org/api.php

Also stashbot has vanished from #wikimedia-operations:

09:17:36 ⇐ •stashbot quit (~stashbot@wikimedia/bot/stashbot) Ping timeout: 255 seconds

Which sounds like an issue with Toolforge

Event Timeline

hashar triaged this task as Unbreak Now! priority.Apr 4 2023, 9:24 AM

This is blocking the MediaWiki train (T330209) and is thus an Unbreak Now! priority.

The toolforge bot at https://train-blockers.toolforge.org/api.php gave me an error after several seconds about a DNS failure:

<br />
<b>Warning</b>:  mysqli::__construct(): php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution in <b>/data/project/train-blockers/tool-train-blockers/utils.php</b> on line <b>43</b><br />
<br />
<b>Warning</b>:  mysqli::__construct(): (HY000/2002): php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution in <b>/data/project/train-blockers/tool-train-blockers/utils.php</b> on line <b>43</b><br />
{"error":"Connection failed: php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution"}

I don't know about stashbot.

We probably shouldn't have a dependency on toolforge in scap either right ? Maybe a try/catch with a prompt abort/continue should be added around those functions.

hashar lowered the priority of this task from Unbreak Now! to High.Apr 4 2023, 9:40 AM
hashar added subscribers: taavi, aborrero.

We probably shouldn't have a dependency on toolforge in snap either right ? Maybe a try/catch with a prompt abort/continue should be added around those functions.

Yes definitely and I have filed it as T333924.

Meanwhile stashbot came back:

09:24:53 → stashbot joined (~stashbot@wikimedia/bot/stashbot)

And https://train-blockers.toolforge.org/api.php gives result again, I am thus lowering the priority and it is no more a train blocker.

@aborrero and @taavi on IRC mentioned it is something related to Kubernetes. I will let them report the actions / debugging etc to mark this one resolved.

I confirm something happened in the toolforge kubernetes cluster that required us to reboot the control nodes.

We are still investigating the root cause.

aborrero renamed this task from toolforge k8s control plane freezing to toolforge k8s control plane freezing and other stability issues.Apr 4 2023, 11:48 AM