Beta cluster scap job ( beta-scap-eqiad ) fails due to puppet erasing /etc/ssh/ssh_known_hosts
Open, Stalled, NormalPublic

Description

We have random failure of the Jenkins job that runs scap on beta ( https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ ). Looking at deployment-tin.deployment-prep.eqiad.wmflabs it seems to be puppet deleting /etc/ssh/ssh_known_hosts:

Info: Caching catalog for deployment-tin.deployment-prep.eqiad.wmflabs
Info: Applying configuration version '1488369728'
Notice: /Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: 
--- /etc/ssh/ssh_known_hosts    2017-03-01 12:23:44.684803618 +0000
+++ /tmp/puppet-file20170301-26119-1x0zk44  2017-03-01 13:54:03.737377130 +0000
@@ -1,62 +1 @@
 # This file is managed by puppet
-deployment-apertium02.deployment-prep.eqiad.wmflabs,deployment-apertium02,10.68.22.254 ecdsa-sha2-nistp256
-deployment-aqs01.deployment-prep.eqiad.wmflabs,deployment-aqs01,10.68.18.237 ecdsa-sha2-nistp256
<snip>
-deployment-urldownloader.deployment-prep.eqiad.wmflabs,deployment-urldownloader,10.68.16.135 ecdsa-sha2-nistp256
-deployment-zookeeper01.deployment-prep.eqiad.wmflabs,deployment-zookeeper01,10.68.17.157 ecdsa-sha2-nistp256
-deployment-zotero01.deployment-prep.eqiad.wmflabs,deployment-zotero01,10.68.17.102 ecdsa-sha2-nistp256

Notice: /Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: content changed '{md5}6391f0f800c5fba5000111eeb73304ed' to '{md5}3e04e1a906c0d2df1d4eea34e4bd0f4e'
Notice: /Stage[main]/Deployment::Deployment_server/Salt::Grain[deployment_server]/Exec[ensure_deployment_server_true]/returns: executed successfully
Info: Salt::Grain[deployment_server]: Scheduling refresh of Exec[deployment_server_sync_all]
Notice: /Stage[main]/Deployment::Deployment_server/Exec[deployment_server_sync_all]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Deployment::Deployment_server/Exec[eventual_consistency_deployment_server_init]/returns: executed successfully
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root.d]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/root.d]/ensure: removed
Notice: Finished catalog run in 164.36 seconds
hashar created this task.Wed, Mar 1, 2:12 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Mar 1, 2:12 PM
hashar added a comment.Thu, Mar 2, 8:41 PM

$ md5sum <( echo "# This file is managed by puppet" )
3e04e1a906c0d2df1d4eea34e4bd0f4e /dev/fd/63

zgrep -c 3e04e1a906c0d2df1d4eea34e4bd0f4e /var/log/puppet.log*
/var/log/puppet.log:8
/var/log/puppet.log.1:13
/var/log/puppet.log.2.gz:4
/var/log/puppet.log.3.gz:0
/var/log/puppet.log.4.gz:0
/var/log/puppet.log.5.gz:0
/var/log/puppet.log.6.gz:0
/var/log/puppet.log.7.gz:0

The first occurence had configuration version 1488328294 Wed, 01 Mar 2017 00:31:34 GMT

The previous one was 45 minutes before at 1488325610 Tue, 28 Feb 2017 23:46:50 GMT

Mentioned in SAL (#wikimedia-releng) [2017-03-02T20:47:54Z] <hashar> deployment-prep: restarted apache/puppet master. Maybe that will fix ssh_known_hosts being emptied from time to time T159332

Paladox added a subscriber: Paladox.Fri, Mar 3, 9:12 AM
Zppix added a subscriber: Zppix.Fri, Mar 3, 2:35 PM

So this is happening continually, but the root cause is somewhat confounding.

According to the log this is happening in Notice: /Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: but this shouldn't be happening there since it should *only* be modifying ssh_known_hosts for production: https://github.com/wikimedia/puppet/blob/production/modules/ssh/manifests/client.pp#L7

thcipriani triaged this task as "Normal" priority.Fri, Mar 3, 4:52 PM
thcipriani moved this task from To Triage to In-progress on the Release-Engineering-Team board.

It seems this is as a result of a patch on beta puppetmaster related to T153163: Set up and use exported resources for Tool Labs's shared knowledge. Specifically I8cfba01da13191384a607ac75917c55a28ecfda1

hashar added a comment.Fri, Mar 3, 4:55 PM

Ahhh I forgot about cherry-picks. So yeah that seems to enable collection of host keys to generate the file. However we have:

modules/role/manifests/ci/slave/labs.pp
# The sshkey resource seems to modify file permissions and make it
# unreadable - this is a known bug (https://tickets.puppetlabs.com/browse/PUP-2900)
# Trying to define this file resource, and notify the resource to be ensured
# from the sshkey resource, to see if it fixes the problem
file { '/etc/ssh/ssh_known_hosts':
    ensure => file,
    mode   => '0644',
}

And I guess puppet has two difference resources in race condition for /etc/ssh/ssh_known_hosts. The one above tentatively causing an empty file.

Maybe try to delete that snippet, cherry pick and see what happens? Seems it was done solely to fix the file permission which were not readable by group and others. That might have been fixed in puppet since then.

And I guess puppet has two difference resources in race condition for /etc/ssh/ssh_known_hosts. The one above tentatively causing an empty file.

I have my doubts that that is what's happening. It seems that the puppet file resource would just change permissions without modifying the contents of the file. Also, according to puppet's output ssh::client is responsible for both adding and removing the keys

Removal Run

Notice: /Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: 
--- /etc/ssh/ssh_known_hosts	2017-03-03 15:24:13.027867703 +0000
+++ /tmp/puppet-file20170303-5268-1uylndk	2017-03-03 19:53:45.895441494 +0000
@@ -1,62 +1 @@
 # This file is managed by puppet
-deployment-apertium02.deployment-prep.eqiad.wmflabs,deployment-apertium02,10.68.22.254 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBJV8KDbwY/thBxkZMXtkuQDSdvwW1immdEZx5Jm/GCehuI8dsneF4gFXRzNDWggSVqQG1DmTSE7uGNm8tMTCoKU=

etc.

Next puppet run

thcipriani@deployment-tin:~$ sudo puppet agent -t
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-tin.deployment-prep.eqiad.wmflabs
Info: Applying configuration version '1488571223'
Notice: /Stage[main]/Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: 
--- /etc/ssh/ssh_known_hosts	2017-03-03 19:53:45.955441703 +0000
+++ /tmp/puppet-file20170303-13172-kjdhgg	2017-03-03 20:01:54.373160546 +0000
@@ -1 +1,62 @@
 # This file is managed by puppet
+deployment-apertium02.deployment-prep.eqiad.wmflabs,deployment-apertium02,10.68.22.254 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBJV8KDbwY/thBxkZMXtkuQDSdvwW1immdEZx5Jm/GCehuI8dsneF4gFXRzNDWggSVqQG1DmTSE7uGNm8tMTCoKU=

etc.

It seems like the puppet query resources for @SshKey are flapping?

hashar added a comment.Fri, Mar 3, 8:51 PM

I thought that maybe the File resource would mess up with the content somehow.

Apparently beta uses PuppetDB 413cc5a7a4e8043ecce2fb40fb24cbed627a4094 so maybe that is the part flapping. I don't know anything about Puppet DB though :(

I think this means these resrouces exist. I found them in the psql db that backs Puppet DB. Documenting here:

thcipriani@deployment-puppetdb01:~$ psql -U puppetdb -h localhost
Password for user puppetdb:
psql (9.4.10)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

puppetdb=> \c puppetdb
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
You are now connected to database "puppetdb" as user "puppetdb".
puppetdb=> select title from catalog_resources where type = 'Sshkey' and exported = True limit 1\x\g\x
Expanded display is on.
-[ RECORD 1 ]-------------------------------------------
title | deployment-ms-be02.deployment-prep.eqiad.wmflabs

Expanded display is off.
puppetdb=> select count(*) from catalog_resources where type = 'Sshkey' and exported = True limit 1\x\g\x
Expanded display is on.
-[ RECORD 1 ]
count | 61

Expanded display is off.

Good query to note

puppetdb=> select title, parameters from catalog_resources cr join resource_params rp on cr.resource = rp.resource join resource_params_cache rpc on rpc.resource = cr.resource where cr.type = 'Sshkey' and exported = True limit 1\x\g\x                                                                                                          
Expanded display is on.
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------                                                                        title      | deployment-mediawiki06.deployment-prep.eqiad.wmflabs
parameters | {"ensure":"present","host_aliases":["deployment-mediawiki06","10.68.19.241"],"key":"AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHNqtybQsR/Gv8ODuClW0
u0xZrRLN9zkE6r3CqPGd7EzgQrWxmveL9/qaMtanxv/41gj/4zk0YUXcAT7xIhSZCM=","type":"ecdsa-sha2-nistp256"}
thcipriani changed the task status from "Open" to "Stalled".Tue, Mar 7, 6:44 PM
thcipriani added subscribers: akosiaris, Joe.

So I put a dumb fix in place and copied a good version of the known_hosts file into ~/.ssh/known_hosts for the jenkins-deploy user—beta-scap-eqiad will continue to work, but the global known_hosts file will continue to flap.

I can't account for why this is flapping. The parameters for each host seem correct in the database; however, whether those values are being changed in the database, or whether puppetdb's ability to answer queries correctly is what's causing /etc/ssh/ssh_known_hosts to be emptied is not clear to me.

@Joe or @akosiaris given your familiarity with puppetdb, can you provide any guidance?