Page MenuHomePhabricator

Rebuild beta cluster saltmaster on Jessie
Closed, ResolvedPublic

Description

From Andrew: the production salt master (neodymium) has been rebuilt on Jessie.

We hence want to rebuild deployment-salt.

Event Timeline

Instances should be moving over to the new saltmaster as puppet runs across the cluster.

These ones are stuck, mostly due to puppet failures:

krenair@deployment-salt:~$ sudo salt '*' cmd.run echo
deployment-puppetmaster.deployment-prep.eqiad.wmflabs:
deployment-aqs01.deployment-prep.eqiad.wmflabs:
deployment-cache-upload04.deployment-prep.eqiad.wmflabs:
deployment-cache-text04.deployment-prep.eqiad.wmflabs:

deployment-puppetmaster seems to be responding to both salt masters?!?

I killed an old salt-minion process on -puppetmaster and that appears to have fixed the weirdness there. The others have these puppet errors:

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item cache::cluster in any Hiera data file and no default supplied at /etc/puppet/modules/role/manifests/cache/base.pp:17 on node deployment-cache-text04.deployment-prep.eqiad.wmflabs
Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item cache::cluster in any Hiera data file and no default supplied at /etc/puppet/modules/role/manifests/cache/base.pp:17 on node deployment-cache-upload04.deployment-prep.eqiad.wmflabs
Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item aqs_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/aqs.pp:59 on node deployment-aqs01.deployment-prep.eqiad.wmflabs

And deployment-tin has unhappy puppet as well, likely trebuchet/salt related:

Error: /Stage[main]/Deployment::Deployment_server/Exec[eventual_consistency_deployment_server_init]/returns: change from notrun to 0 failed: salt-call deploy.deployment_server_init returned 255 instead of one of [0]

I was futzing on deployment-tin today and noticed this issue. From the looks of it, deployment-salt02 is just missing a few roles:

thcipriani@deployment-puppetmaster:~$ ldapsearch -LLL -x   -D 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org'    -w $(grep -Po "(?<=bindpw).*" /etc/ldap.conf)   -b 'ou=hosts,dc=wikimedia,dc=org'    -z 1   "associatedDomain=deployment-salt02.eqiad.wmflabs"                                                                                                              
dn: dc=deployment-salt02.deployment-prep.eqiad.wmflabs,ou=hosts,dc=wikimedia,d
 c=org
aRecord: 10.68.17.58
objectClass: domainRelatedObject
objectClass: dNSDomain
objectClass: puppetClient
objectClass: domain
objectClass: dcObject
objectClass: top
associatedDomain: deployment-salt02.deployment-prep.eqiad.wmflabs
associatedDomain: deployment-salt02.eqiad.wmflabs
l: eqiad
dc: deployment-salt02.deployment-prep.eqiad.wmflabs
puppetVar: instanceproject=deployment-prep
puppetVar: instancename=deployment-salt02

vs old deployment-salt

thcipriani@deployment-puppetmaster:~$ ldapsearch -LLL -x   -D 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org'    -w $(grep -Po "(?<=bindpw).*" /etc/ldap.conf)   -b 'ou=hosts,dc=wikime
dia,dc=org'    -z 1   "associatedDomain=deployment-salt.eqiad.wmflabs"                                                                                                          
dn: dc=deployment-salt.deployment-prep.eqiad.wmflabs,ou=hosts,dc=wikimedia,dc=
 org
objectClass: domainrelatedobject
objectClass: dnsdomain
objectClass: domain
objectClass: puppetclient
objectClass: dcobject
objectClass: top
l: eqiad
associatedDomain: i-0000015c.eqiad.wmflabs
associatedDomain: deployment-salt.eqiad.wmflabs
associatedDomain: i-0000015c.deployment-prep.eqiad.wmflabs
associatedDomain: deployment-salt.deployment-prep.eqiad.wmflabs
dc: deployment-salt.deployment-prep.eqiad.wmflabs
aRecord: 10.68.16.99
puppetClass: beta::saltmaster::tools
puppetClass: role::deployment::salt_masters
puppetClass: role::labs::lvm::srv
puppetClass: role::salt::masters::labs::project_master
puppetVar: deployment_server_override=deployment-bastion.eqiad.wmflabs
puppetVar: instancename=deployment-salt
puppetVar: instanceproject=deployment-prep
puppetVar: salt_master_finger_override=dd:d8:68:70:8c:65:a3:af:46:5c:3f:4f:d4:
 be:6c:71
puppetVar: salt_master_override=deployment-salt.eqiad.wmflabs

As a result, the deploy.py salt-module isn't on the new machine:

thcipriani@deployment-salt02:/srv/salt/_modules$ ls -l
total 0

This is probably what's causing the puppet error on deployment-tin.

I was adding classes to the new host through hiera instead of LDAP, but missed role::deployment::salt_masters. I don't know about role::labs::lvm::srv...

Looking at deployment-salt I suspect the extra space provided by role::labs::lvm::srv is unnecessary. Shall we shut down the old host and close this?

Looking at deployment-salt I suspect the extra space provided by role::labs::lvm::srv is unnecessary. Shall we shut down the old host and close this?

Sounds good to me :)

Thanks for the continued work on deployment-prep. Very much appreciated by all of Release-Engineering-Team

I've shut down deployment-salt. It still exists but I'll delete it at a later date. At that point salt will stop working entirely for those hosts which have broken puppet (see "blocked by" task).

Mentioned in SAL [2016-06-01T10:14:16Z] <hashar> beta: salt-key -d deployment-salt.deployment-prep.eqiad.wmflabs T136411

Mentioned in SAL [2016-06-01T10:29:42Z] <hashar> Upgraded Linux kernel on deployment-salt02 T136411

I compared the list of instances connected to the new server against nova list on silver, everything looks correct so I've deleted deployment-salt.