Page MenuHomePhabricator

mw2295.codfw.wmnet returned [255]: Host key verification failed.
Closed, ResolvedPublic

Description

The host got reimaged:

2021-01-27
01:25 	<legoktm@cumin1001> 	conftool action : set/pooled=yes; selector: name=mw2295.codfw.wmnet 	[production]
01:20 	<legoktm@cumin1001> 	conftool action : set/pooled=no; selector: name=mw2295.codfw.wmnet 	[production]
00:52 	<legoktm@cumin1001> 	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE 	[production]
00:49 	<legoktm@cumin1001> 	START - Cookbook sre.hosts.downtime for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE 	[production]

Running scap today complains about an erroneous ssh host key:

deploy1001$
10:03:59 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw2295.codfw.wmnet returned [255]: Host key verification failed.

10:04:00 1 hosts had failures restarting php-fpm
mw2295
The last Puppet run was at Wed Jan 27 09:24:45 UTC 2021 (10122 minutes ago). Puppet is disabled. 

$ last
hashar   pts/0        208.80.153.54    Wed Feb  3 10:07   still logged in
legoktm  pts/0        208.80.153.54    Wed Jan 27 01:22 - 01:22  (00:00)
reboot   system boot  4.19.0-13-amd64  Wed Jan 27 01:11   still running
reboot   system boot  4.19.0-13-amd64  Wed Jan 27 00:46 - 01:08  (00:21)

But it still in the dsh groups:

grep -R mw2295 /etc/dsh
/etc/dsh/group/mediawiki-installation:mw2295.codfw.wmnet
/etc/dsh/group/api_appserver:mw2295.codfw.wmnet

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-02-03T10:16:43Z] <legoktm> re-enabled puppet on mw2295 (T273726)

hashar assigned this task to Legoktm.

SSH host keys are collected by puppet on the hosts and writen to /etc/ssh/ssh_known_hosts and since puppet was disabled the key was not collected. That is solved now!

puppet hadn't run since the reimaging, which is problem. I would've logged into it after the reimaging to run scap pull before repooling it, but it's possible I didn't read the MOTD properly.

After re-enabling puppet and then forcing a puppet run on mw2295 and then on deploy1001:

Notice: /Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: 
--- /etc/ssh/ssh_known_hosts	2021-02-03 09:55:51.564327004 +0000
+++ /tmp/puppet-file20210203-8385-tyjj4o	2021-02-03 10:23:13.156410659 +0000
@@ -1287,6 +1287,7 @@
 mw2292.codfw.wmnet,mw2292,10.192.0.162,2620:0:860:101:10:192:0:162 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIkWc6yIcR0e/kQX24yfcvukVcj0SYlsqMrNy9Q4qgE+MUZpwRg+q/+2wqatYDQlFuy5tlFgqaxZ1FmowpEKBTk=
 mw2293.codfw.wmnet,mw2293,10.192.0.163,2620:0:860:101:10:192:0:163 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBLOnHvHuIWYLXYULjFDIdRNzvajYDdup2U+eynjZ9H1mlBfFr91xEioL6Uaz9B5oEh8ZSFs4ArqpfSWZAAEb5rA=
 mw2294.codfw.wmnet,mw2294,10.192.0.164,2620:0:860:101:10:192:0:164 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBKIJmWCk9eLpLnLr41/eLfPGqmsnMrdPC6KH88KOmZ3UqSa3tPDzLrhL6UmBpEgOlDsVusDPu3o6+QuW/vLZx1I=
+mw2295.codfw.wmnet,mw2295,10.192.0.165,2620:0:860:101:10:192:0:165 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBOdJ3WhOvOB/k01lnZynn+0rpV+cWylCU3quVOezb69ZiDLvZnDJUhFVEDVQ7g1yjX0EV8stu8MKUMX9uR9Y3vY=
 mw2296.codfw.wmnet,mw2296,10.192.0.166,2620:0:860:101:10:192:0:166 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCrgQ8o04cgGzxDJM4i/CDgtx1YM/hXxt85s+lpp4nREPtL833JgohK2dl0ghbMJSAzEKWH9eeqyk1AK1bFP0Xg=
 mw2297.codfw.wmnet,mw2297,10.192.0.167,2620:0:860:101:10:192:0:167 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBMOYNvrNyoa8ZXY64hLh6FCnrHNM1dncU/NnqkboK3Tq0fqMb4xyFWHnXrvpe1Dgg7WzKYYObql93wNLd7tsDHY=
 mw2298.codfw.wmnet,mw2298,10.192.0.168,2620:0:860:101:10:192:0:168 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBJH8gR2CSjJk3ORl/LfORUBcXkLJ0x3ABgzT6hXxTSTTunf6c4zTznIiWkw+K9/McEBNS7K7ZZu5hffkvDTtnY8=

Notice: /Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: content changed '{md5}36b43193b4282f7f4528e86241a8c4d1' to '{md5}0935ea7e99a5001a850fe4764de8b559'

So now it should work properly. What I don't understand is how no other deployer has complained about this since I reimaged it.

Some kind of fluke? meh. I now check that puppet is enabled after I finish reimaging each host.