Page MenuHomePhabricator

SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update
Open, MediumPublic

Description

sudo gnt-cluster verify fails in the ulsfo and eqsin cluster afters the update to Bullseye in the "Verifying node status" step where it's connecting in between nodes via SSH in a loop. Apart from that all cluster operations are working fine.

I obtained the exact command from the node-daemon.log:

jmm@ganeti5003:~$ sudo ssh -oEscapeChar=none -oHashKnownHosts=no -oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts -oUserKnownHostsFile=/dev/null -oCheckHostIp=no -oConnectTimeout=10 -o\HostKeyAlias=ganeti01.svc.eqsin.wmnet -oPort=22 -oBatchMode=yes -oStrictHostKeyChecking=yes -4 root@ganeti5002.eqsin.wmnet
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:SvgqUu5xa8VQDLoLxIyOmrLn8MeUT9pxnen80BQKSfY.
Please contact your system administrator.
Add correct host key in /dev/null to get rid of this message.
Offending RSA key in /var/lib/ganeti/known_hosts:1
  remove with:
  ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti01.svc.eqsin.wmnet"
ECDSA host key for ganeti01.svc.eqsin.wmnet has changed and you have requested strict checking.
Host key verification failed.

Surprisingly this doesn't happen on ganeti/drmrs (which was installed with Bullseye), ganeti/esams (which was upgraded from Buster to Bullseye), ganeti/test (which was upgraded from Buster to Bullseye). The same command on ganeti-test works fine:

jmm@ganeti-test2002:~$ sudo ssh -oEscapeChar=none -oHashKnownHosts=no -oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts -oUserKnownHostsFile=/dev/null -oCheckHostIp=no -oConnectTimeout=10 -oHostKeyAlias=ganeti-test01.svc.codfw.wmnet -oPort=22 -oBatchMode=yes -oStrictHostKeyChecking=yes -4 root@ganeti-test2001.codfw.wmnet

When digging around I also found https://github.com/ganeti/ganeti/issues/1608 which seems to be the exact issue.

Event Timeline

Should we have /var/lib/ganeti/known_hosts be managed by Puppet?

Should we have /var/lib/ganeti/known_hosts be managed by Puppet?

Yeah, I think that's the best fix for us. The impact still isn't fully understood as part as the upstream code base is concerned, I think it's caused by newer OpenSSH releases being more stringent in validating

Mentioned in SAL (#wikimedia-operations) [2024-02-26T20:44:55Z] <mutante> T358237 used the next hostname number,1004, to avoid the duplicate IP issue. makevm cookbook is at attempt 103/240 to detect a reboot of the VM and uptime just keeps going up. used the "gnt-instance console --show-cmd " trick to get a console despite https://phabricator.wikimedia.org/T309724 - was missing partman config

Change #1021896 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Install a Puppet generator to create a known hosts file for Ganeti

https://gerrit.wikimedia.org/r/1021896

Change #1023486 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] WIP: add function to generate ganeti known hosts

https://gerrit.wikimedia.org/r/1023486

Change #1023486 merged by JHathaway:

[operations/puppet@production] ganeti: function to generate ganeti known hosts

https://gerrit.wikimedia.org/r/1023486