Page MenuHomePhabricator

Ganeti expired certificate errors in ulsfo
Closed, ResolvedPublic

Description

We had some alerts relating to ganeti masater node ganeti4008 today

PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti

Looking on the host itself it seems it is failing to connect to other nodes in the cluster due to expired TLS certs on them, going back to at least Dec 24th. Here is one example but the logs are full of these and for multiple hosts:

Dec 24 01:19:40 ganeti4008 ganeti[3756419]: Error in the RPC HTTP reply from 'Node {nodeName = "ganeti4007.ulsfo.wmnet", nodePrimaryIp = "10.128.0.15", nodeSecondaryIp = "10.128.0.15", nodeMasterCandidate = True, nodeOffline = False, nodeDrained = False, nodeGroup = "c26ee5b5-14f7-4782-9696-1a15046ffe98", nodeMasterCapable = True, nodeVmCapable = True, nodeNdparams = PartialNDParams {ndpOobProgramP = Nothing, ndpSpindleCountP = Nothing, ndpExclusiveStorageP = Nothing, ndpOvsP = Nothing, ndpOvsNameP = Nothing, ndpOvsLinkP = Nothing, ndpSshPortP = Nothing, ndpCpuSpeedP = Nothing}, nodePowered = True, nodeHvStateStatic = GenericContainer {fromContainer = fromList []}, nodeDiskStateStatic = GenericContainer {fromContainer = fromList []}, nodeCtime = Tue Dec 20 09:48:01 UTC 2022, nodeMtime = Tue Dec 20 09:48:01 UTC 2022, nodeUuid = "a6d252a0-b940-48ee-a590-5e8b6cda681f", nodeSerial = 1, nodeTags = TagSet {unTagSet = fromList []}}': CurlLayerError "code: CurlSSLCACert, explanation: SSL certificate problem: certificate has expired"

Looking at the host using the IP from the above log as an example, however, it's cert seems to be ok:

cmooney@ganeti4007:~$ sudo openssl x509 -in /etc/ganeti/ssl/discovery__ganeti01_svc_ulsfo_wmnet.pem -noout -dates
notBefore=Dec 24 01:34:00 2024 GMT
notAfter=Jan 21 01:34:00 2025 GMT

All systems seem to have correct system time also. The issue is definitely causing problems, normal ganeti operations are not working on the master node:

cmooney@ganeti4008:~$ sudo gnt-node list 
Timeout while talking to the master daemon. Jobs might have been submitted and will continue to run even if the call timed out. Useful commands in this situation are "gnt-job list", "gnt-job cancel" and "gnt-job watch". Error:
Connect timed out
cmooney@ganeti4008:~$ sudo gnt-job list 
Timeout while talking to the master daemon. Jobs might have been submitted and will continue to run even if the call timed out. Useful commands in this situation are "gnt-job list", "gnt-job cancel" and "gnt-job watch". Error:
Connect timed out

I've checked a few VMs in the cluster and they appear to be online and working, so thankfully this is probably just causing problems for ganeti admin tasks and not causing user-facing or other issues.

Event Timeline

cmooney triaged this task as Medium priority.Jan 2 2025, 9:48 AM
cmooney created this task.

Checking with curl form ganeti4008 to ganeti4006 on TCP 1811 this is reported:

* Server certificate:
*  subject: CN=ganeti.example.com
*  start date: Dec 17 19:12:26 2019 GMT
*  expire date: Dec 15 19:12:26 2024 GMT
*  issuer: CN=ganeti.example.com

Not sure if that is relevant or not - it's just "ganeti.example.com" and self-signed so perhaps a red herring. However comparing to ganeti3006 in esams it shows the same but the cert dates are still valid:

* Server certificate:
*  subject: CN=ganeti.example.com
*  start date: Aug 16 12:19:03 2023 GMT
*  expire date: Aug 14 12:19:03 2028 GMT
*  issuer: CN=ganeti.example.com

It seems we lost monitoring for the internal Ganeti cert expiry in the whole Icinga->AM work? This was definitely alerted for in the past. I'll open a separate task for this.

The /etc/ganeti/ssl/discovery__ganeti01_svc_ulsfo_wmnet.pem certificate is issued by cfssl, but the RAPI endpoint can't operate without the certs of the undelying Ganeti cluster. This doesn't impact running VMs.

This task made me check the expiry of the other clusters and it turned out the eqsin cluster is also only three days away! I tried to renew the certificate using https://wikitech.wikimedia.org/wiki/Ganeti#Renew_cluster_certificates, but that aborted since the gnt-cluster command was unable to commands on the non-master nodes. @akosiaris and myself looked into this and it turned out that the permissions of the Ganeti SSH key were too wide for some reason (644), so the incoming SSL call was rejected by OpenSSH. After re-running the command with fixed permissions "gnt-cluster renew-crypto" was re-run, but the master node itself failed to start with certificate errors (even though server and client-side certs were apparently correctly rotated). One possible remaining factor was that initially the SSH key permissions for the master node wasn't also updated to 600 (that was only realised later), but eventually we rolled back to the original certs (which renew-crypto helpfully makes backup copies of). The root cause is still unclear (debugging this involved significant head-scratching...) hopefully re-running renew-crypto with fixed SSH permissions across the cluster resolves this for good.

It's unclear why the permissions for the ganeti SSH key were too wide, the file got created on 01-12-2022 with these permissions, which matches the time in SAL when the server was added to the Ganeti cluster (as a refresh of older servers), so maybe there's a bug in the underlying Ganeti code to add a node (or at least in the Ganeti version used in 2022, we're using a newer one these days). I'll try to repro this in ganeti-test next week.

Mentioned in SAL (#wikimedia-operations) [2025-01-03T12:10:25Z] <moritzm> renewed internal Ganeti certs in eqsin (would have expired in two days) T382873

Icinga downtime and Alertmanager silence (ID=d605aa1f-a578-4f3f-b71b-febdf98df1fb) set by jmm@cumin2002 for 3:00:00 on 4 host(s) and their services with reason: renew certs

ganeti[4005-4008].ulsfo.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-01-07T11:25:11Z] <moritzm> refreshed Ganeti internal cert for ulsfo (after adding a manual temp cert to unblock itself) T382873

MoritzMuehlenhoff claimed this task.

I created a temporary /var/lib/ganeti/server.pem certificate to unblock gnt-cluster (following https://wikitech.wikimedia.org/wiki/Ganeti#Expired_cluster_certificates) and eventually refreshed the full server and client cert stack using "gnt-cluster renew-crypto".

Running gnt-cluster verify still flagged some issues, notably the fact that the servers were installed when we were adding a "this file was autogenrated by Puppet on $DATE" header to /etc/hosts which makes "gnt-cluster verify" complain about mismatched files. I simply removed the header (inline with current installations) and now all is working fine:

jmm@ganeti4008:~$ sudo gnt-cluster verify
Submitted jobs 528363, 528364
Waiting for job 528363 ...
Tue Jan  7 11:53:31 2025 * Verifying cluster config
Tue Jan  7 11:53:31 2025 * Verifying cluster certificate files
Tue Jan  7 11:53:31 2025 * Verifying hypervisor parameters
Tue Jan  7 11:53:31 2025 * Verifying all nodes belong to an existing group
Waiting for job 528364 ...
Tue Jan  7 11:53:32 2025 * Verifying group '1'
Tue Jan  7 11:53:32 2025 * Gathering data (4 nodes)
Tue Jan  7 11:53:32 2025 * Gathering information about nodes (4 nodes)
Tue Jan  7 11:53:36 2025 * Gathering disk information (4 nodes)
Tue Jan  7 11:53:36 2025 * Verifying configuration file consistency
Tue Jan  7 11:53:36 2025 * Verifying node status
Tue Jan  7 11:53:36 2025 * Verifying instance status
Tue Jan  7 11:53:36 2025 * Verifying orphan volumes
Tue Jan  7 11:53:36 2025 * Verifying N+1 Memory redundancy
Tue Jan  7 11:53:36 2025 * Other Notes
Tue Jan  7 11:53:37 2025 * Hooks Results
jmm@ganeti4008:~$ echo $?
0
jmm@ganeti4008:~$