Page MenuHomePhabricator

PAWS down
Closed, ResolvedPublic

Description

PAWS went down along with other things in T329535 though magnum still cannot deploy (T329212 and T328560), so paws has not returned.

Event Timeline

Scaling the cluster to 0 nodes and back seems to have got PAWS working with a redeploy of paws on the current cluster (if anyone chooses to try this route, it took several tries to shrink and grow the cluster and force deletes of the ingress-nginx and prod namespaces). However haproxy is not running puppet, giving a failure on a puppet agent -tv:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node paws-k8s-haproxy-1.paws.eqiad.wmflabs: Failed to find paws-k8s-haproxy-1.paws.eqiad.wmflabs via exec: Execution of '/usr/local/bin/puppet-enc paws-k8s-haproxy-1.paws.eqiad.wmflabs' returned 1: 
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

and editing the haproxy file manually with the new nodes is not working as haproxy is failing to reload.

I restarted paws-puppetmaster-2 which got puppet 'working' on paws-k8s-haproxy-1.paws.eqiad.wmflabs. Puppet runs don't work, though, because of references to no-longer-existing VMs:

Feb 14 03:37:42 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:17] : 'server paws-k8s-control-1.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-1.paws.eqiad.wmflabs'.
Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:18] : 'server paws-k8s-control-2.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-2.paws.eqiad.wmflabs'.
Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:19] : 'server paws-k8s-control-3.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-3.paws.eqiad.wmflabs'.
Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : Failed to initialize server(s) addr.

I'm guessing those hostnames are in hiera and need updating for the new build. @rook can you take it from here?

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T08:15:42Z] <taavi> empty profile::wmcs::paws::control_nodes hiera key to bring PAWS back up (T329581), it contained the hostnames of the old kubeadm backed cluster which should be cleaned up properly in T327674

taavi claimed this task.

hub pod is periodically failing liveliness probes, but paws does seem to be running otherwise. Should still get a rebuild as it is not in the most trustworthy of states.