Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | taavi | T329581 PAWS down | |||
Resolved | BUG REPORT | dcaro | T329535 Cloud Ceph outage 2023-02-13 | ||
Resolved | dcaro | T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner | |||
Resolved | dcaro | T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity | |||
Open | dcaro | T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network | |||
Resolved | Request | Papaul | T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet | ||
Resolved | cmooney | T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU |
Event Timeline
Scaling the cluster to 0 nodes and back seems to have got PAWS working with a redeploy of paws on the current cluster (if anyone chooses to try this route, it took several tries to shrink and grow the cluster and force deletes of the ingress-nginx and prod namespaces). However haproxy is not running puppet, giving a failure on a puppet agent -tv:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node paws-k8s-haproxy-1.paws.eqiad.wmflabs: Failed to find paws-k8s-haproxy-1.paws.eqiad.wmflabs via exec: Execution of '/usr/local/bin/puppet-enc paws-k8s-haproxy-1.paws.eqiad.wmflabs' returned 1: Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
and editing the haproxy file manually with the new nodes is not working as haproxy is failing to reload.
I restarted paws-puppetmaster-2 which got puppet 'working' on paws-k8s-haproxy-1.paws.eqiad.wmflabs. Puppet runs don't work, though, because of references to no-longer-existing VMs:
Feb 14 03:37:42 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:17] : 'server paws-k8s-control-1.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-1.paws.eqiad.wmflabs'. Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:18] : 'server paws-k8s-control-2.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-2.paws.eqiad.wmflabs'. Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:19] : 'server paws-k8s-control-3.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-3.paws.eqiad.wmflabs'. Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy[637]: [ALERT] 044/033742 (637) : Failed to initialize server(s) addr.
I'm guessing those hostnames are in hiera and need updating for the new build. @rook can you take it from here?
Mentioned in SAL (#wikimedia-cloud) [2023-02-14T08:15:42Z] <taavi> empty profile::wmcs::paws::control_nodes hiera key to bring PAWS back up (T329581), it contained the hostnames of the old kubeadm backed cluster which should be cleaned up properly in T327674
hub pod is periodically failing liveliness probes, but paws does seem to be running otherwise. Should still get a rebuild as it is not in the most trustworthy of states.