We had some alerts relating to ganeti masater node ganeti4008 today
PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
Looking on the host itself it seems it is failing to connect to other nodes in the cluster due to expired TLS certs on them, going back to at least Dec 24th. Here is one example but the logs are full of these and for multiple hosts:
Dec 24 01:19:40 ganeti4008 ganeti[3756419]: Error in the RPC HTTP reply from 'Node {nodeName = "ganeti4007.ulsfo.wmnet", nodePrimaryIp = "10.128.0.15", nodeSecondaryIp = "10.128.0.15", nodeMasterCandidate = True, nodeOffline = False, nodeDrained = False, nodeGroup = "c26ee5b5-14f7-4782-9696-1a15046ffe98", nodeMasterCapable = True, nodeVmCapable = True, nodeNdparams = PartialNDParams {ndpOobProgramP = Nothing, ndpSpindleCountP = Nothing, ndpExclusiveStorageP = Nothing, ndpOvsP = Nothing, ndpOvsNameP = Nothing, ndpOvsLinkP = Nothing, ndpSshPortP = Nothing, ndpCpuSpeedP = Nothing}, nodePowered = True, nodeHvStateStatic = GenericContainer {fromContainer = fromList []}, nodeDiskStateStatic = GenericContainer {fromContainer = fromList []}, nodeCtime = Tue Dec 20 09:48:01 UTC 2022, nodeMtime = Tue Dec 20 09:48:01 UTC 2022, nodeUuid = "a6d252a0-b940-48ee-a590-5e8b6cda681f", nodeSerial = 1, nodeTags = TagSet {unTagSet = fromList []}}': CurlLayerError "code: CurlSSLCACert, explanation: SSL certificate problem: certificate has expired"
Looking at the host using the IP from the above log as an example, however, it's cert seems to be ok:
cmooney@ganeti4007:~$ sudo openssl x509 -in /etc/ganeti/ssl/discovery__ganeti01_svc_ulsfo_wmnet.pem -noout -dates notBefore=Dec 24 01:34:00 2024 GMT notAfter=Jan 21 01:34:00 2025 GMT
All systems seem to have correct system time also. The issue is definitely causing problems, normal ganeti operations are not working on the master node:
cmooney@ganeti4008:~$ sudo gnt-node list Timeout while talking to the master daemon. Jobs might have been submitted and will continue to run even if the call timed out. Useful commands in this situation are "gnt-job list", "gnt-job cancel" and "gnt-job watch". Error: Connect timed out
cmooney@ganeti4008:~$ sudo gnt-job list Timeout while talking to the master daemon. Jobs might have been submitted and will continue to run even if the call timed out. Useful commands in this situation are "gnt-job list", "gnt-job cancel" and "gnt-job watch". Error: Connect timed out
I've checked a few VMs in the cluster and they appear to be online and working, so thankfully this is probably just causing problems for ganeti admin tasks and not causing user-facing or other issues.