a lot of beta cluster instances are not reachable over SSH
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Jul 20 2017, 2:53 PM

Description

A lot of instances in beta cluster are no more reachable over SSH because the LDAP server is no more trusted:

2017-07-20T07:48:13.225792+00:00 castor nslcd[604]: [d89a32] <group/member="puppet"> no available LDAP server found: Server is unavailable
[09:48:58]  <hashar>	failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)

The exact same issue happened on CI instances and I had to rebuild them from scratch

castor.integration.eqiad.wmflabs T171148
integration-slave-docker-1000 which I have rebuild from scratch).

Additionally the beta cluster puppet master was broken for most of the day. I eventually managed to fix it up by rewriting the puppet.conf from scratch and I have removed the puppet db config. So at least puppet works now which might or might not magically fix SSH to the instances.

For all instances that had puppet broken, I believe they are deadlocked now. We can try to reach them via salt and attempt to salvage them eg:

root@deployment-salt02:~# salt -v 'deployment-sca01*' cmd.run 'puppet agent -tv'
....

If puppet works, a project admin can add its SSH keys to the list of extra root keys on Hiera:Deployment-prep. Example: https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1765755&oldid=1763078

http://shinken.wmflabs.org/problems?search=deployment might offers a good perspective.

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T171174 a lot of beta cluster instances are not reachable over SSH
Resolved	Ottomata	T171177 New instance in deployment prep can't run puppet for the first time
Resolved	fgiunchedi	T171454 deployment-ms-beXX Duplicate declaration: Exec[swift_udev_reload]

Event Timeline

hashar created this task.Jul 20 2017, 2:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 2:53 PM

hashar mentioned this in T171148: CI jobs are blocked because castor is unreachable.Jul 20 2017, 2:53 PM

Now that puppet is fixed, you can either wait a few hours for puppet to run on all the instances or run the salt command. But you have to restart nscd and nslcd also too. or restart the instances through horizon webui.

hashar updated the task description. (Show Details)Jul 20 2017, 2:55 PM

Mentioned in SAL (#wikimedia-releng) [2017-07-20T15:08:10Z] <hashar> removed profile::recommendation_api from deployment-sca01 to try to fix the ssh access for mobrovac T171173 T171174

bd808 edited projects, added VPS-Projects; removed Cloud-Services.Jul 20 2017, 3:13 PM

Ottomata created subtask T171177: New instance in deployment prep can't run puppet for the first time.Jul 20 2017, 3:15 PM

https://wikitech.wikimedia.org/wiki/Incident_documentation/20170719-ldap#CI.2Fbeta

So the state as I understand it right now:

The puppet master was broken, I had it fixed by removing the PuppetDB configuration and rebuilding the puppet.conf manually.

Any instance that had puppet failing did not receive the new WMF CA certificate. Thus Ssh -> pam_ldap can not connect to the labs LDAP and Ssh access is rejected.

A way to fix them is to use salt to connect, then check why puppet fails:

$ ssh deployment-saltmaster02.deployment-prep.eqiad.wmflabs
$ sudo salt -v <YOUR INSTANCE FQDN> cmd.run 'tail -n 50 /var/log/puppet.log'

From there the easiest is to:

remove the puppet classes from the instance (note them somewhere in order to restore them later).
run puppet from the salt master:
- sudo salt -v <YOUR INSTANCE FQDN> cmd.run 'puppet agent -tv'

That should populate the proper WMF CA certificate.

Restart nslcd
sudo salt -v <YOUR INSTANCE FQDN> cmd.run 'systemctl restart nslcd'

Access should work again.

If the instance is not reachable via salt, fill a subtask and CC Cloud-VPS , they should be able to reach the instance via the KVM console.

hashar triaged this task as High priority.Jul 20 2017, 4:35 PM

Announced on the QA list pointing back to this task

Mentioned in SAL (#wikimedia-releng) [2017-07-20T16:42:47Z] <hashar> How to fix ssh access on beta cluster instances: https://phabricator.wikimedia.org/T171174#3456966

• mobrovac moved this task from Backlog to watching on the Services board.Jul 20 2017, 6:34 PM

• mobrovac edited projects, added Services (watching); removed Services.

hashar closed subtask T171177: New instance in deployment prep can't run puppet for the first time as Resolved.Jul 20 2017, 8:19 PM

zeljkofilipin subscribed.Jul 21 2017, 9:27 AM

hashar added a subtask: T171454: deployment-ms-beXX Duplicate declaration: Exec[swift_udev_reload].Jul 24 2017, 9:59 AM

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:02:56Z] <hashar> Removing role::mobileapps from deployment-mcs01 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:06:32Z] <hashar> Removing role::ocg from deployment-mcs01 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:09:10Z] <hashar> Removing role::changeprop from deployment-changeprop to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:12:09Z] <hashar> Removing role::mathoid from deployment-mathoid to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:32:06Z] <hashar> Removing profile::etcd from deployment-conf03 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:54:46Z] <hashar> Removing classes from deployment-sca02 and deployment-sca03 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T10:59:50Z] <hashar> Removing class from deployment-trending01 to let puppet run. Reapplying it after. - T171174

Mentioned in SAL (#wikimedia-releng) [2017-07-24T11:02:02Z] <hashar> Removing profile::swift::storage::labs class from deployment-ms-be03 and deployment-ms-be04 to let puppet run. Reapplying it after. - T171174 T171454

Stashbot mentioned this in T171454: deployment-ms-beXX Duplicate declaration: Exec[swift_udev_reload].Jul 24 2017, 11:02 AM

I have removed faulty puppet classes, ran puppet, restarted nslcd and reapplied the puppet classes

Validation done against all the beta cluster instances with:

for instance in "${INSTANCES[@]}"
do
	echo -n "$instance: "
	ssh  "$instance" echo OK
done

deployment-apertium02.deployment-prep.eqiad.wmflabs: OK
deployment-aqs01.deployment-prep.eqiad.wmflabs: OK
deployment-aqs02.deployment-prep.eqiad.wmflabs: OK
deployment-aqs03.deployment-prep.eqiad.wmflabs: OK
deployment-cache-text04.deployment-prep.eqiad.wmflabs: OK
deployment-cache-upload04.deployment-prep.eqiad.wmflabs: OK
deployment-changeprop.deployment-prep.eqiad.wmflabs: OK
deployment-conf03.deployment-prep.eqiad.wmflabs: OK
deployment-db03.deployment-prep.eqiad.wmflabs: OK
deployment-db04.deployment-prep.eqiad.wmflabs: OK
deployment-elastic05.deployment-prep.eqiad.wmflabs: OK
deployment-elastic06.deployment-prep.eqiad.wmflabs: OK
deployment-elastic07.deployment-prep.eqiad.wmflabs: OK
deployment-etcd-01.deployment-prep.eqiad.wmflabs: OK
deployment-eventlog02.deployment-prep.eqiad.wmflabs: OK
deployment-eventlogging04.deployment-prep.eqiad.wmflabs: OK
deployment-fluorine02.deployment-prep.eqiad.wmflabs: OK
deployment-imagescaler01.deployment-prep.eqiad.wmflabs: OK
deployment-ircd.deployment-prep.eqiad.wmflabs: OK
deployment-jobrunner02.deployment-prep.eqiad.wmflabs: OK
deployment-kafka01.deployment-prep.eqiad.wmflabs: OK
deployment-kafka03.deployment-prep.eqiad.wmflabs: OK
deployment-kafka04.deployment-prep.eqiad.wmflabs: OK
deployment-kafka05.deployment-prep.eqiad.wmflabs: OK
deployment-logstash2.deployment-prep.eqiad.wmflabs: OK
deployment-mathoid.deployment-prep.eqiad.wmflabs: OK
deployment-mcs01.deployment-prep.eqiad.wmflabs: OK
deployment-mediawiki04.deployment-prep.eqiad.wmflabs: OK
deployment-mediawiki05.deployment-prep.eqiad.wmflabs: OK
deployment-mediawiki06.deployment-prep.eqiad.wmflabs: OK
deployment-memc04.deployment-prep.eqiad.wmflabs: OK
deployment-memc05.deployment-prep.eqiad.wmflabs: OK
deployment-mira.deployment-prep.eqiad.wmflabs: OK
deployment-ms-be03.deployment-prep.eqiad.wmflabs: OK
deployment-ms-be04.deployment-prep.eqiad.wmflabs: OK
deployment-ms-fe02.deployment-prep.eqiad.wmflabs: OK
deployment-mx.deployment-prep.eqiad.wmflabs: OK
deployment-ores-redis-01.deployment-prep.eqiad.wmflabs: OK
deployment-parsoid09.deployment-prep.eqiad.wmflabs: OK
deployment-pdf01.deployment-prep.eqiad.wmflabs: OK
deployment-pdfrender02.deployment-prep.eqiad.wmflabs: OK
deployment-poolcounter04.deployment-prep.eqiad.wmflabs: OK
deployment-prometheus01.deployment-prep.eqiad.wmflabs: OK
deployment-puppetdb01.deployment-prep.eqiad.wmflabs: OK
deployment-puppetmaster02.deployment-prep.eqiad.wmflabs: OK
deployment-redis01.deployment-prep.eqiad.wmflabs: OK
deployment-redis02.deployment-prep.eqiad.wmflabs: OK
deployment-restbase01.deployment-prep.eqiad.wmflabs: OK
deployment-restbase02.deployment-prep.eqiad.wmflabs: OK
deployment-salt02.deployment-prep.eqiad.wmflabs: OK
deployment-sca01.deployment-prep.eqiad.wmflabs: OK
deployment-sca02.deployment-prep.eqiad.wmflabs: OK
deployment-sca03.deployment-prep.eqiad.wmflabs: OK
deployment-sca04.deployment-prep.eqiad.wmflabs: OK
deployment-secureredirexperiment.deployment-prep.eqiad.wmflabs: OK
deployment-sentry01.deployment-prep.eqiad.wmflabs: OK
deployment-stream.deployment-prep.eqiad.wmflabs: OK
deployment-tin.deployment-prep.eqiad.wmflabs: OK
deployment-tmh01.deployment-prep.eqiad.wmflabs: OK
deployment-trending01.deployment-prep.eqiad.wmflabs: OK
deployment-urldownloader.deployment-prep.eqiad.wmflabs: OK
deployment-zookeeper02.deployment-prep.eqiad.wmflabs: OK
deployment-zotero01.deployment-prep.eqiad.wmflabs: OK

fgiunchedi closed subtask T171454: deployment-ms-beXX Duplicate declaration: Exec[swift_udev_reload] as Resolved.Aug 2 2017, 2:45 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban).Sep 26 2017, 11:44 PM

a lot of beta cluster instances are not reachable over SSHClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

a lot of beta cluster instances are not reachable over SSH
Closed, ResolvedPublic
Actions

Related Objects
Search...