Page MenuHomePhabricator

Use conftool to build scap's kubernetes_workers host list
Closed, ResolvedPublic

Description

We currently build this list using a pdb query that returns all hosts with the Profile::Kubernetes::Mediawiki_runner declared, including nodes in Failed state in netbox.

While this is not an operational issue as the node would download the image automatically when back in a ready state and scheduled for a mediawiki pod, it is causing confusion for the deployers who see an error at the docker_pull_k8s stage.

Switching to using the cluster=kubernetes,service=kubesvc conftool query would allow for the depooling of nodes to exclude them from the list, removing the confusing errors. In order to avoid pulling the image on sessionstore nodes, we can add a check in mediawiki-image-download.sh.

Event Timeline

Change #1047031 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] wikikube: Use conftool for scap docker_pull_k8s

https://gerrit.wikimedia.org/r/1047031

Clement_Goubert changed the task status from Open to In Progress.Jun 18 2024, 12:00 PM
Clement_Goubert triaged this task as Medium priority.

Change #1047031 merged by Clément Goubert:

[operations/puppet@production] wikikube: Use conftool for scap docker_pull_k8s

https://gerrit.wikimedia.org/r/1047031

Mentioned in SAL (#wikimedia-operations) [2024-06-20T10:14:47Z] <claime> Draining and depooling mw2321.codfw.wmnet to test 1047031 - T367862

Icinga downtime and Alertmanager silence (ID=168530de-ae67-4629-9a37-1afa12ced9b6) set by cgoubert@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Test scap with host unavailable

mw2321.codfw.wmnet

Pooling mw2321 as inactive and shutting it down resulted in no errors during a mediawiki scap deployment. That should eliminate confusion around unavailable hosts during deployments.

Mentioned in SAL (#wikimedia-operations) [2024-06-20T10:31:20Z] <claime> repooling and uncordoning mw2321.codfw.wmnet - T367862