CI jobs are blocked because castor is unreachable
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Jul 20 2017, 7:53 AM

Description

castor.integration.eqiad.wmflabs is no more reachable by SSH. Puppet has not been running for the last 10 days. It did not receive the new Puppet CA, that causes LDAP to fail:

2017-07-20T07:48:13.225792+00:00 castor nslcd[604]: [d89a32] <group/member="puppet"> no available LDAP server found: Server is unavailable
[09:48:58]  <hashar>	failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)

salt is broken as well, hence the instance is unreachable.

Details

Subject	Repo	Branch	Lines +/-
Point Castor to castor02.integration.eqiad.wmflabs	integration/config	master	+3 -3
Revert "Disable castor entirely"	integration/config	master	+39 -6
Disable castor entirely	integration/config	master	+6 -39

Customize query in gerrit

Related Objects

Mentioned In: T171174: a lot of beta cluster instances are not reachable over SSH
Mentioned Here: T171174: a lot of beta cluster instances are not reachable over SSH

Event Timeline

hashar created this task.Jul 20 2017, 7:53 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 7:53 AM

Mentioned in SAL (#wikimedia-releng) [2017-07-20T07:55:05Z] <hashar> Refreshing all Jenkins jobs defined in JJB in order to then disable castor entirely for T171148

Change 366520 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Disable castor entirely

https://gerrit.wikimedia.org/r/366520

gerritbot added a project: Patch-For-Review.Jul 20 2017, 7:59 AM

Mentioned in SAL (#wikimedia-releng) [2017-07-20T08:00:30Z] <hashar> Disabled castor entirely via https://gerrit.wikimedia.org/r/366520 . The instance is broken - T171148

Mentioned in SAL (#wikimedia-operations) [2017-07-20T08:25:34Z] <hashar> CI is restored albeit in degraded mode (lack of Castor cache) - T171148

From the console log, puppet-agent on boot reports:

SSL_connect returned=1 errno=0 state=error: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: virt1000.wikimedia.org]

And salt does not work because it is apparently connected to the labs salt master instead of the CI salt master:

2017-07-20T08:28:16.753242+00:00 castor rc.local[533]: + '[' eqiad.wmflabs == eqiad.wmflabs ']'
2017-07-20T08:28:16.753460+00:00 castor rc.local[533]: + master=labs-puppetmaster-eqiad.wikimedia.org
2017-07-20T08:28:16.753595+00:00 castor rc.local[533]: + master_secondary=labs-puppetmaster-codfw.wikimedia.org
2017-07-20T08:28:16.753753+00:00 castor rc.local[533]: + '[' eqiad.wmflabs == codfw.wmflabs ']

In short the instance is completely broken. It is easier to just recreate it.

Change 366520 merged by jenkins-bot:
[integration/config@master] Disable castor entirely

https://gerrit.wikimedia.org/r/366520

Mentioned in SAL (#wikimedia-releng) [2017-07-20T08:53:56Z] <hashar> Created castor02.integration.eqiad.wmflabs with puppet role role::ci::castor::server and adding it to Jenkins. Will then update the Jenkins jobs to point to it - T171148

Change 366523 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Revert "Disable castor entirely"

https://gerrit.wikimedia.org/r/366523

Change 366524 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Point Castor to castor02.integration.eqiad.wmflabs

https://gerrit.wikimedia.org/r/366524

hashar claimed this task.Jul 20 2017, 8:59 AM

hashar triaged this task as Unbreak Now! priority.

Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptJul 20 2017, 8:59 AM

Mentioned in SAL (#wikimedia-releng) [2017-07-20T09:03:12Z] <hashar> Restoring castorby updating all jobs to point to castor02 ( https://gerrit.wikimedia.org/r/366524 ) Starts with a cold cache :( - T171148

Change 366523 merged by jenkins-bot:
[integration/config@master] Revert "Disable castor entirely"

https://gerrit.wikimedia.org/r/366523

Change 366524 merged by jenkins-bot:
[integration/config@master] Point Castor to castor02.integration.eqiad.wmflabs

https://gerrit.wikimedia.org/r/366524

Mentioned in SAL (#wikimedia-operations) [2017-07-20T09:04:11Z] <hashar> Restored CI cache storage (castor) on a fresh new instance. Cache is empty though so jobs will be a bit slower until the cache is populated - T171148