Page MenuHomePhabricator

Migrate Restbase-dev cluster to Stretch
Closed, ResolvedPublic

Description

Most of production Restbase servers are running Stretch, but Restbase-dev is still on Jessie:

  • restbase-dev1004.eqiad.wmnet
  • restbase-dev1005.eqiad.wmnet
  • restbase-dev1006.eqiad.wmnet

Event Timeline

For restbase-dev1006 there is T224260: restbase-dev1006 has a broken disk, so that would need to be resolved first (and AFAIK it's likely we'll need to end up re-imaging the machine anyway).

ArielGlenn triaged this task as Normal priority.Jun 11 2019, 7:59 AM
Eevans added a project: User-Eevans.
Eevans moved this task from Backlog to In-Progress on the User-Eevans board.

Mentioned in SAL (#wikimedia-operations) [2019-09-04T20:14:48Z] <urandom> decommission restbase-dev1004-a (Cassandra) -- T224554

Mentioned in SAL (#wikimedia-operations) [2019-09-04T23:05:27Z] <urandom> decommission restbase-dev1004-b (Cassandra) -- T224554

Eevans added a comment.Thu, Sep 5, 2:10 AM

restbase-dev1004 has been decommissioned and can come down for a re-image at any time.

/cc @MoritzMuehlenhoff

Mentioned in SAL (#wikimedia-operations) [2019-09-05T08:16:45Z] <moritzm> reimage restbase-dev1004 to Stretch T224554

restbase-dev1004 has been reinstalled as Stretch. @Eevans, you can bootstrap 1004 in Cassandra and decom 1005 in Cassandra, then I'll proceed with reimaging 1005.

Eevans added a comment.Thu, Sep 5, 2:19 PM

restbase-dev1004 has been reinstalled as Stretch. @Eevans, you can bootstrap 1004 in Cassandra and decom 1005 in Cassandra, then I'll proceed with reimaging 1005.

On it; Thanks!

Eevans added a comment.Thu, Sep 5, 2:45 PM

restbase-dev1004 has been reinstalled as Stretch. @Eevans, you can bootstrap 1004 in Cassandra and decom 1005 in Cassandra, then I'll proceed with reimaging 1005.

Trouble: -dev1004 won't bootstrap because the keys/certs have expired.

restbase-dev1004:/var/log/cassandra/system-a.log
ERROR [MessagingService-Outgoing-restbase-dev1005-b.eqiad.wmnet/10.64.16.98-Gossip] 2019-09-05 14:21:55,467 OutboundTcpConnection.java:537 - SSL handshake error for outbound connection to 14709965[SSL_NULL_WITH_NULL_NULL: Socket[addr=restbase-dev1005-b.eqiad.wmnet/10.64.16.98,port=7001,localport=38794]]
javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateExpiredException: NotAfter: Tue Aug 13 22:16:15 UTC 2019
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) ~[na:1.8.0_222]
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) ~[na:1.8.0_222]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) ~[na:1.8.0_222]
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) ~[na:1.8.0_222]
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639) ~[na:1.8.0_222]
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223) ~[na:1.8.0_222]
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) ~[na:1.8.0_222]
	at sun.security.ssl.Handshaker.process_record(Handshaker.java:965) ~[na:1.8.0_222]
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064) ~[na:1.8.0_222]
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) ~[na:1.8.0_222]
	at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:750) ~[na:1.8.0_222]
	at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123) ~[na:1.8.0_222]
	at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458) ~[na:1.8.0_222]
	at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.doFlush(BufferedDataOutputStreamPlus.java:323) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.flush(BufferedDataOutputStreamPlus.java:331) ~[apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:461) [apache-cassandra-3.11.2.jar:3.11.2]
	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:262) [apache-cassandra-3.11.2.jar:3.11.2]
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Tue Aug 13 22:16:15 UTC 2019
	at sun.security.x509.CertificateValidity.valid(CertificateValidity.java:274) ~[na:1.8.0_222]
	at sun.security.x509.X509CertImpl.checkValidity(X509CertImpl.java:629) ~[na:1.8.0_222]
	at sun.security.validator.SimpleValidator.engineValidate(SimpleValidator.java:201) ~[na:1.8.0_222]
	at sun.security.validator.Validator.validate(Validator.java:262) ~[na:1.8.0_222]
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:330) ~[na:1.8.0_222]
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:237) ~[na:1.8.0_222]
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132) ~[na:1.8.0_222]
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621) ~[na:1.8.0_222]
	... 12 common frames omitted
$ keytool -list -v -keystore /etc/cassandra-a/tls/server.key | grep -i "valid from"
Valid from: Mon Aug 13 22:16:11 UTC 2018 until: Tue Aug 13 22:16:11 UTC 2019
$

I'd put together a Gerrit for this, but I don't have access to private.git. :( In the past this has been done by @fgiunchedi

Documentation https://wikitech.wikimedia.org/wiki/Cassandra/Tools/cassandra-ca-manager & https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates

Mentioned in SAL (#wikimedia-operations) [2019-09-06T06:03:01Z] <mutante> puppetmaster1001 - copying cassandra-ca-manager to /usr/local/bin - deleting expired restbase-dev1004 certs - running cassandra-ca-manager services-dev.yaml T224554

Mentioned in SAL (#wikimedia-operations) [2019-09-06T06:09:47Z] <mutante> puppetmaster1001 - same for restbase-dev1005 and restbase-dev1006 (T224554)

Dzahn added a subscriber: Dzahn.Fri, Sep 6, 6:14 AM

@Eevans I recreated the certs for restbase-dev1004 through restbase-dev1006 and committed in the private repo. Please try again now.

Dzahn added a comment.Fri, Sep 6, 6:17 AM
@restbase-dev1004 :  keytool -list -v -keystore /etc/cassandra-a/tls/server.key 2>/dev/null | grep "Valid from"

Valid from: Fri Sep 06 06:01:30 UTC 2019 until: Sat Sep 05 06:01:30 UTC 2020
Valid from: Thu Jan 05 22:53:00 UTC 2017 until: Fri Dec 24 22:53:00 UTC 2066
Valid from: Thu Jan 05 22:53:00 UTC 2017 until: Fri Dec 24 22:53:00 UTC 2066

Mentioned in SAL (#wikimedia-operations) [2019-09-09T14:22:08Z] <urandom> bootstrapping Cassandra, restbase-dev1004-a -- T224554

Mentioned in SAL (#wikimedia-operations) [2019-09-09T20:50:39Z] <urandom> bootstrapping Cassandra, restbase-dev1004-b -- T224554

@Eevans I recreated the certs for restbase-dev1004 through restbase-dev1006 and committed in the private repo. Please try again now.

Thanks @Dzahn !

Mentioned in SAL (#wikimedia-operations) [2019-09-10T00:41:38Z] <urandom> decommissioning Cassandra, restbase-dev1005-a -- T224554

Eevans updated the task description. (Show Details)Tue, Sep 10, 12:41 AM

Mentioned in SAL (#wikimedia-operations) [2019-09-10T05:33:02Z] <urandom> decommissioning Cassandra, restbase-dev1005-b -- T224554

I've started the decommission of -dev1005-b quite late in my evening; It should be complete by EU morning. If there is no output from running ssh restbase-dev1004.eqiad.wmnet -- c-any-nt status -r | grep 1005, then the node can be taken down for reimage.

Mentioned in SAL (#wikimedia-operations) [2019-09-10T16:24:12Z] <urandom> disabling reserved space on restbase-dev1005:/dev/mapper/restbase--dev1005--vg-srv -- T224554

restbase-dev1005 has been decommissioned and is ready to be reimaged.

$ ssh restbase-dev1004.eqiad.wmnet -- c-any-nt status -r |grep 1005
$

Mentioned in SAL (#wikimedia-operations) [2019-09-11T07:52:50Z] <moritzm> reimaging restbase-dev1005 to Stretch T224554

restbase1005-dev is now running Stretch and good to bootstrap.

Mentioned in SAL (#wikimedia-operations) [2019-09-11T16:24:21Z] <urandom> bootstrapping Cassandra, restbase-dev1005-a -- T224554

Mentioned in SAL (#wikimedia-operations) [2019-09-11T20:57:32Z] <urandom> bootstrapping Cassandra, restbase-dev1005-b -- T224554

MoritzMuehlenhoff closed this task as Resolved.Thu, Sep 12, 12:27 PM
MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff updated the task description. (Show Details)

This is complete, restbase-dev is running Stretch.