Page MenuHomePhabricator

a lot of beta cluster instances are not reachable over SSH
Closed, ResolvedPublic


A lot of instances in beta cluster are no more reachable over SSH because the LDAP server is no more trusted:

2017-07-20T07:48:13.225792+00:00 castor nslcd[604]: [d89a32] <group/member="puppet"> no available LDAP server found: Server is unavailable
[09:48:58]  <hashar>	failed to bind to LDAP server ldap:// Connect error: (unknown error code)

The exact same issue happened on CI instances and I had to rebuild them from scratch

  • castor.integration.eqiad.wmflabs T171148
  • integration-slave-docker-1000 which I have rebuild from scratch).

Additionally the beta cluster puppet master was broken for most of the day. I eventually managed to fix it up by rewriting the puppet.conf from scratch and I have removed the puppet db config. So at least puppet works now which might or might not magically fix SSH to the instances.

For all instances that had puppet broken, I believe they are deadlocked now. We can try to reach them via salt and attempt to salvage them eg:

root@deployment-salt02:~# salt -v 'deployment-sca01*' 'puppet agent -tv'

If puppet works, a project admin can add its SSH keys to the list of extra root keys on Hiera:Deployment-prep. Example: might offers a good perspective.

Event Timeline

Now that puppet is fixed, you can either wait a few hours for puppet to run on all the instances or run the salt command. But you have to restart nscd and nslcd also too. or restart the instances through horizon webui.

Stashbot subscribed.

Mentioned in SAL (#wikimedia-releng) [2017-07-20T15:08:10Z] <hashar> removed profile::recommendation_api from deployment-sca01 to try to fix the ssh access for mobrovac T171173 T171174

So the state as I understand it right now:

The puppet master was broken, I had it fixed by removing the PuppetDB configuration and rebuilding the puppet.conf manually.

Any instance that had puppet failing did not receive the new WMF CA certificate. Thus Ssh -> pam_ldap can not connect to the labs LDAP and Ssh access is rejected.

A way to fix them is to use salt to connect, then check why puppet fails:

$ ssh deployment-saltmaster02.deployment-prep.eqiad.wmflabs
$ sudo salt -v <YOUR INSTANCE FQDN> 'tail -n 50 /var/log/puppet.log'

From there the easiest is to:

  • remove the puppet classes from the instance (note them somewhere in order to restore them later).
  • run puppet from the salt master:
    • sudo salt -v <YOUR INSTANCE FQDN> 'puppet agent -tv'

That should populate the proper WMF CA certificate.

Restart nslcd
sudo salt -v <YOUR INSTANCE FQDN> 'systemctl restart nslcd'

Access should work again.

If the instance is not reachable via salt, fill a subtask and CC Cloud-VPS , they should be able to reach the instance via the KVM console.

hashar triaged this task as High priority.Jul 20 2017, 4:35 PM

Announced on the QA list pointing back to this task

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:02:56Z] <hashar> Removing role::mobileapps from deployment-mcs01 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:06:32Z] <hashar> Removing role::ocg from deployment-mcs01 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:09:10Z] <hashar> Removing role::changeprop from deployment-changeprop to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:12:09Z] <hashar> Removing role::mathoid from deployment-mathoid to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:32:06Z] <hashar> Removing profile::etcd from deployment-conf03 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:54:46Z] <hashar> Removing classes from deployment-sca02 and deployment-sca03 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:59:50Z] <hashar> Removing class from deployment-trending01 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T11:02:02Z] <hashar> Removing profile::swift::storage::labs class from deployment-ms-be03 and deployment-ms-be04 to let puppet run. Reapplying it after. - T171174 T171454

hashar claimed this task.

I have removed faulty puppet classes, ran puppet, restarted nslcd and reapplied the puppet classes

Validation done against all the beta cluster instances with:

for instance in "${INSTANCES[@]}"
	echo -n "$instance: "
	ssh  "$instance" echo OK

deployment-apertium02.deployment-prep.eqiad.wmflabs: OK
deployment-aqs01.deployment-prep.eqiad.wmflabs: OK
deployment-aqs02.deployment-prep.eqiad.wmflabs: OK
deployment-aqs03.deployment-prep.eqiad.wmflabs: OK
deployment-cache-text04.deployment-prep.eqiad.wmflabs: OK
deployment-cache-upload04.deployment-prep.eqiad.wmflabs: OK
deployment-changeprop.deployment-prep.eqiad.wmflabs: OK
deployment-conf03.deployment-prep.eqiad.wmflabs: OK
deployment-db03.deployment-prep.eqiad.wmflabs: OK
deployment-db04.deployment-prep.eqiad.wmflabs: OK
deployment-elastic05.deployment-prep.eqiad.wmflabs: OK
deployment-elastic06.deployment-prep.eqiad.wmflabs: OK
deployment-elastic07.deployment-prep.eqiad.wmflabs: OK
deployment-etcd-01.deployment-prep.eqiad.wmflabs: OK
deployment-eventlog02.deployment-prep.eqiad.wmflabs: OK
deployment-eventlogging04.deployment-prep.eqiad.wmflabs: OK
deployment-fluorine02.deployment-prep.eqiad.wmflabs: OK
deployment-imagescaler01.deployment-prep.eqiad.wmflabs: OK
deployment-ircd.deployment-prep.eqiad.wmflabs: OK
deployment-jobrunner02.deployment-prep.eqiad.wmflabs: OK
deployment-kafka01.deployment-prep.eqiad.wmflabs: OK
deployment-kafka03.deployment-prep.eqiad.wmflabs: OK
deployment-kafka04.deployment-prep.eqiad.wmflabs: OK
deployment-kafka05.deployment-prep.eqiad.wmflabs: OK
deployment-logstash2.deployment-prep.eqiad.wmflabs: OK
deployment-mathoid.deployment-prep.eqiad.wmflabs: OK
deployment-mcs01.deployment-prep.eqiad.wmflabs: OK
deployment-mediawiki04.deployment-prep.eqiad.wmflabs: OK
deployment-mediawiki05.deployment-prep.eqiad.wmflabs: OK
deployment-mediawiki06.deployment-prep.eqiad.wmflabs: OK
deployment-memc04.deployment-prep.eqiad.wmflabs: OK
deployment-memc05.deployment-prep.eqiad.wmflabs: OK
deployment-mira.deployment-prep.eqiad.wmflabs: OK
deployment-ms-be03.deployment-prep.eqiad.wmflabs: OK
deployment-ms-be04.deployment-prep.eqiad.wmflabs: OK
deployment-ms-fe02.deployment-prep.eqiad.wmflabs: OK
deployment-mx.deployment-prep.eqiad.wmflabs: OK
deployment-ores-redis-01.deployment-prep.eqiad.wmflabs: OK
deployment-parsoid09.deployment-prep.eqiad.wmflabs: OK
deployment-pdf01.deployment-prep.eqiad.wmflabs: OK
deployment-pdfrender02.deployment-prep.eqiad.wmflabs: OK
deployment-poolcounter04.deployment-prep.eqiad.wmflabs: OK
deployment-prometheus01.deployment-prep.eqiad.wmflabs: OK
deployment-puppetdb01.deployment-prep.eqiad.wmflabs: OK
deployment-puppetmaster02.deployment-prep.eqiad.wmflabs: OK
deployment-redis01.deployment-prep.eqiad.wmflabs: OK
deployment-redis02.deployment-prep.eqiad.wmflabs: OK
deployment-restbase01.deployment-prep.eqiad.wmflabs: OK
deployment-restbase02.deployment-prep.eqiad.wmflabs: OK
deployment-salt02.deployment-prep.eqiad.wmflabs: OK
deployment-sca01.deployment-prep.eqiad.wmflabs: OK
deployment-sca02.deployment-prep.eqiad.wmflabs: OK
deployment-sca03.deployment-prep.eqiad.wmflabs: OK
deployment-sca04.deployment-prep.eqiad.wmflabs: OK
deployment-secureredirexperiment.deployment-prep.eqiad.wmflabs: OK
deployment-sentry01.deployment-prep.eqiad.wmflabs: OK
deployment-stream.deployment-prep.eqiad.wmflabs: OK
deployment-tin.deployment-prep.eqiad.wmflabs: OK
deployment-tmh01.deployment-prep.eqiad.wmflabs: OK
deployment-trending01.deployment-prep.eqiad.wmflabs: OK
deployment-urldownloader.deployment-prep.eqiad.wmflabs: OK
deployment-zookeeper02.deployment-prep.eqiad.wmflabs: OK
deployment-zotero01.deployment-prep.eqiad.wmflabs: OK