CI jobs are blocked because castor is unreachable
Closed, ResolvedPublic

Description

castor.integration.eqiad.wmflabs is no more reachable by SSH. Puppet has not been running for the last 10 days. It did not receive the new Puppet CA, that causes LDAP to fail:

2017-07-20T07:48:13.225792+00:00 castor nslcd[604]: [d89a32] <group/member="puppet"> no available LDAP server found: Server is unavailable
[09:48:58]  <hashar>	failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Connect error: (unknown error code)

salt is broken as well, hence the instance is unreachable.

hashar created this task.Jul 20 2017, 7:53 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 7:53 AM

Mentioned in SAL (#wikimedia-releng) [2017-07-20T07:55:05Z] <hashar> Refreshing all Jenkins jobs defined in JJB in order to then disable castor entirely for T171148

Change 366520 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Disable castor entirely

https://gerrit.wikimedia.org/r/366520

Mentioned in SAL (#wikimedia-releng) [2017-07-20T08:00:30Z] <hashar> Disabled castor entirely via https://gerrit.wikimedia.org/r/366520 . The instance is broken - T171148

Mentioned in SAL (#wikimedia-operations) [2017-07-20T08:25:34Z] <hashar> CI is restored albeit in degraded mode (lack of Castor cache) - T171148

From the console log, puppet-agent on boot reports:

SSL_connect returned=1 errno=0 state=error: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: virt1000.wikimedia.org]

And salt does not work because it is apparently connected to the labs salt master instead of the CI salt master:

2017-07-20T08:28:16.753242+00:00 castor rc.local[533]: + '[' eqiad.wmflabs == eqiad.wmflabs ']'
2017-07-20T08:28:16.753460+00:00 castor rc.local[533]: + master=labs-puppetmaster-eqiad.wikimedia.org
2017-07-20T08:28:16.753595+00:00 castor rc.local[533]: + master_secondary=labs-puppetmaster-codfw.wikimedia.org
2017-07-20T08:28:16.753753+00:00 castor rc.local[533]: + '[' eqiad.wmflabs == codfw.wmflabs ']

In short the instance is completely broken. It is easier to just recreate it.

Change 366520 merged by jenkins-bot:
[integration/config@master] Disable castor entirely

https://gerrit.wikimedia.org/r/366520

Mentioned in SAL (#wikimedia-releng) [2017-07-20T08:53:56Z] <hashar> Created castor02.integration.eqiad.wmflabs with puppet role role::ci::castor::server and adding it to Jenkins. Will then update the Jenkins jobs to point to it - T171148

Change 366523 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Revert "Disable castor entirely"

https://gerrit.wikimedia.org/r/366523

Change 366524 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Point Castor to castor02.integration.eqiad.wmflabs

https://gerrit.wikimedia.org/r/366524

hashar triaged this task as Unbreak Now! priority.Jul 20 2017, 8:59 AM
hashar claimed this task.
Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptJul 20 2017, 8:59 AM

Mentioned in SAL (#wikimedia-releng) [2017-07-20T09:03:12Z] <hashar> Restoring castorby updating all jobs to point to castor02 ( https://gerrit.wikimedia.org/r/366524 ) Starts with a cold cache :( - T171148

Change 366523 merged by jenkins-bot:
[integration/config@master] Revert "Disable castor entirely"

https://gerrit.wikimedia.org/r/366523

Change 366524 merged by jenkins-bot:
[integration/config@master] Point Castor to castor02.integration.eqiad.wmflabs

https://gerrit.wikimedia.org/r/366524

Mentioned in SAL (#wikimedia-operations) [2017-07-20T09:04:11Z] <hashar> Restored CI cache storage (castor) on a fresh new instance. Cache is empty though so jobs will be a bit slower until the cache is populated - T171148

I have manually repopulated the cache for operations/puppet.git by triggering https://integration.wikimedia.org/ci/job/operations-puppet-cache-update-jessie/

hashar closed this task as Resolved.Jul 20 2017, 9:13 AM

Beta cluster instances have the exact same issue. Filled as T171174