A lot of instances in beta cluster are no more reachable over SSH because the LDAP server is no more trusted:
2017-07-20T07:48:13.225792+00:00 castor nslcd: [d89a32] <group/member="puppet"> no available LDAP server found: Server is unavailable [09:48:58] <hashar> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)
The exact same issue happened on CI instances and I had to rebuild them from scratch
- castor.integration.eqiad.wmflabs T171148
- integration-slave-docker-1000 which I have rebuild from scratch).
Additionally the beta cluster puppet master was broken for most of the day. I eventually managed to fix it up by rewriting the puppet.conf from scratch and I have removed the puppet db config. So at least puppet works now which might or might not magically fix SSH to the instances.
For all instances that had puppet broken, I believe they are deadlocked now. We can try to reach them via salt and attempt to salvage them eg:
root@deployment-salt02:~# salt -v 'deployment-sca01*' cmd.run 'puppet agent -tv' ....
If puppet works, a project admin can add its SSH keys to the list of extra root keys on Hiera:Deployment-prep. Example: https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1765755&oldid=1763078
http://shinken.wmflabs.org/problems?search=deployment might offers a good perspective.