Page MenuHomePhabricator

2024-11-26 Toolforge DNS incident
Closed, ResolvedPublic

Description

There are current widespread intermittent DNS resolution issues within the Toolforge Kubernetes cluster that might have began as early as Sunday. These issues are causing some jobs and deployments to fail, particularly on NFS worker nodes.

Impact:

  • Some tools may experience failed deployments or crashes
  • Job execution may be inconsistent
  • Image pulls may fail intermittently

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The current status is stable again, we are still investigating the root causes, but the cluster is up and running.

Let us know if there's any new issues arising

dcaro triaged this task as High priority.Nov 26 2024, 11:09 AM

Change #1097991 had a related patch set uploaded (by David Caro; author: David Caro):

[cloud/wmcs-cookbooks@main] toolforge.k8s.reboot: swap the control node if it's the one to reboot

https://gerrit.wikimedia.org/r/1097991

Change #1097991 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge.k8s.reboot: swap the control node if it's the one to reboot

https://gerrit.wikimedia.org/r/1097991

aborrero claimed this task.
aborrero subscribed.

I think we can declare this as resolved, and work on the parent/sibling tickets.