|Resolved||taavi||T329581 PAWS down|
|Resolved||BUG REPORT||dcaro||T329535 Cloud Ceph outage 2023-02-13|
|In Progress||dcaro||T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner|
|Resolved||dcaro||T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity|
|Open||dcaro||T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network|
|Resolved||Request||Papaul||T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet|
|Resolved||cmooney||T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU|
- Mentioned In
- T329696: Spawn failed on PAWS
T327674: Remove puppet code related to paws kubeadmin cluster
- Mentioned Here
- T327674: Remove puppet code related to paws kubeadmin cluster
T328560: Cannot create magnum cluster template
T329212: New cluster to manage dns change
T329535: Cloud Ceph outage 2023-02-13
Scaling the cluster to 0 nodes and back seems to have got PAWS working with a redeploy of paws on the current cluster (if anyone chooses to try this route, it took several tries to shrink and grow the cluster and force deletes of the ingress-nginx and prod namespaces). However haproxy is not running puppet, giving a failure on a puppet agent -tv:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node paws-k8s-haproxy-1.paws.eqiad.wmflabs: Failed to find paws-k8s-haproxy-1.paws.eqiad.wmflabs via exec: Execution of '/usr/local/bin/puppet-enc paws-k8s-haproxy-1.paws.eqiad.wmflabs' returned 1: Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
and editing the haproxy file manually with the new nodes is not working as haproxy is failing to reload.
I restarted paws-puppetmaster-2 which got puppet 'working' on paws-k8s-haproxy-1.paws.eqiad.wmflabs. Puppet runs don't work, though, because of references to no-longer-existing VMs:
Feb 14 03:37:42 paws-k8s-haproxy-1 haproxy: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:17] : 'server paws-k8s-control-1.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-1.paws.eqiad.wmflabs'. Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:18] : 'server paws-k8s-control-2.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-2.paws.eqiad.wmflabs'. Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy: [ALERT] 044/033742 (637) : parsing [/etc/haproxy/conf.d//k8s-api-servers.cfg:19] : 'server paws-k8s-control-3.paws.eqiad.wmflabs' : could not resolve address 'paws-k8s-control-3.paws.eqiad.wmflabs'. Feb 14 03:37:43 paws-k8s-haproxy-1 haproxy: [ALERT] 044/033742 (637) : Failed to initialize server(s) addr.
I'm guessing those hostnames are in hiera and need updating for the new build. @rook can you take it from here?