Change Details

A lot of instances in beta cluster are no more reachable over SSH because the LDAP server is no more trusted: ``` 2017-07-20T07:48:13.225792+00:00 castor nslcd[604]: [d89a32] <group/member="puppet"> no available LDAP server found: Server is unavailable [09:48:58] <hashar> failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code) ``` The exact same issue happened on CI instances and I had to rebuild them from scratch * castor.integration.eqiad.wmflabs T171148 * integration-slave-docker-1000 which I have rebuild from scratch). Additionally the beta cluster puppet master was broken for most of the day. I eventually managed to fix it up by rewriting the `puppet.conf` from scratch and I have removed the puppet db config. So at least puppet works now which might or **might not** magically fix SSH to the instances. For all instances that had puppet broken, I believe they are deadlocked now. We can try to reach them via salt and attempt to salvage them eg: root@deployment-salt02:~# salt -v 'deployment-sca01*' cmd.run 'puppet agent -tv' .... If puppet works, a project admin can add its SSH keys to the list of extra root keys on [[ https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep | Hiera:Deployment-prep ]]. Example: https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1765755&oldid=1763078 http://shinken.wmflabs.org/problems?search=deployment might offers a good perspective.