Page MenuHomePhabricator

Puppetize Skein certificate generation
Closed, ResolvedPublic

Description

In debugging a failure of all Airflow jobs running via skein, we found an expired certificate (details below). This seems to have been generated only one-time and probably needs to be puppetized.

(base) btullis@an-launcher1002:/srv/airflow-analytics/.skein$ sudo openssl x509 -in skein.crt -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            60:82:23:5a:85:63:ab:b4:cb:a4:ab:c0:3b:15:91:7a:be:3e:c1:14
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = skein-internal
        Validity
            Not Before: Feb 10 16:52:16 2022 GMT
            Not After : Feb 10 16:52:16 2023 GMT
        Subject: CN = skein-internal
        Subject
Ben Tullis6:06 PM
(base) btullis@an-launcher1002:/srv/airflow-analytics/.skein$ skein config --help
usage: skein config [--help] command ...

Manage configuration

positional arguments:
  command
    gencerts      Generate security credentials. Creates a self-signed TLS
                  key/certificate pair for securing Skein communication, and
                  writes it to the skein configuration directory ("~.skein/"
                  by default).

optional arguments:

AC:

  • skein certificates are managed with Puppet
  • certificates are renewed automatically before their expiration date
  • alert is raised if certificates are about to expire

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

HuH! Who knew?!

I guess we could also just periodically remove the cert and it would be recreated? Or, Maybe not, because running jobs would lose the ability to open connections?

I don't recall ever generating this, so it must have been auto generated.

Just passing by to +1 this effort since T329398 just hit us. Let me clean up the tags..

Gehel updated the task description. (Show Details)

@fgiunchedi - I wonder if you might be able to advise here, please?

We have an x509 certificate on-disk, but it's not exposed via a TCP service.
We would like to check its expiry via Alertmanager but we currently don't have the date of expiry in Prometheus.

How would you go about collecting and alerting on this data? Would we want to use a prometheus::node_textfile resource, for example?
https://github.com/wikimedia/operations-puppet/blob/production/modules/prometheus/manifests/node_textfile.pp

Is there any prior art on collecting this kind of expiry date from a certificate, where we can't access it with a prometheus::blackbox::check::tcp check?
Thanks.

@fgiunchedi - I wonder if you might be able to advise here, please?

We have an x509 certificate on-disk, but it's not exposed via a TCP service.
We would like to check its expiry via Alertmanager but we currently don't have the date of expiry in Prometheus.

How would you go about collecting and alerting on this data? Would we want to use a prometheus::node_textfile resource, for example?
https://github.com/wikimedia/operations-puppet/blob/production/modules/prometheus/manifests/node_textfile.pp

Yes this IMHO is the simplest approach

Is there any prior art on collecting this kind of expiry date from a certificate, where we can't access it with a prometheus::blackbox::check::tcp check?

There is indeed, take a look at modules/puppetmaster/manifests/ca_monitoring.pp which does something along the lines of what you are describing. cc @taavi as the author just in case

@fgiunchedi Nice, thanks! According to you, would it be ok to include this module in a non-puppetmaster-related role?

@fgiunchedi Nice, thanks! According to you, would it be ok to include this module in a non-puppetmaster-related role?

In general yes, although from a quick skim of the code it seems it won't work as-is, it can be used as the base for your needs though

Indeed. A ca_monitoring module with a puppet-agnostic script exporting metrics to prometheus might be the way to go then.

Indeed. A ca_monitoring module with a puppet-agnostic script exporting metrics to prometheus might be the way to go then.

I think it's more likely that the techniques used within that class are applicable here, rather than creating a ca_monitoring module.

So we can define a prometheus::node_textfile resource.

The open question I have is where we would create this new resourse. I think it would probably be best within the airflow::instance defined type. So every time we create an instance, it also creates another of these prometheus exporters automatically.

Does this make sense?

It does! I'm not yet familiar enough with Puppet to explain my train of thoughts with the appropriate terms, but I was indeed thinking about something like this. I didn't know about prometheus::node_textfile. That should prove handy indeed.

Change 966553 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Publish metrics reflecting skein certificate expiry

https://gerrit.wikimedia.org/r/966553

brouberol changed the task status from Open to In Progress.Oct 17 2023, 3:35 PM

Change 966553 merged by Brouberol:

[operations/puppet@production] Publish metrics reflecting skein certificate expiry

https://gerrit.wikimedia.org/r/966553

Change 967404 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Export the certificate path as a label of the expiration date metric

https://gerrit.wikimedia.org/r/967404

Change 967404 merged by Brouberol:

[operations/puppet@production] Export the certificate path as a label of the expiration date metric

https://gerrit.wikimedia.org/r/967404

Change 967409 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/alerts@master] Monitor the expiration date of the skein x509 certificates

https://gerrit.wikimedia.org/r/967409

Change 967409 merged by jenkins-bot:

[operations/alerts@master] Monitor the expiration date of the skein x509 certificates

https://gerrit.wikimedia.org/r/967409

Change 968112 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-engineering: fix deploy-tag for skein cert expiry

https://gerrit.wikimedia.org/r/968112

Change 968112 merged by Filippo Giunchedi:

[operations/alerts@master] data-engineering: fix deploy-tag for skein cert expiry

https://gerrit.wikimedia.org/r/968112

Change 968612 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Enable the management of the skein certificate via Puppet

https://gerrit.wikimedia.org/r/968612

Change 968613 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Enable the management of the skein certificate via Puppet on one instance

https://gerrit.wikimedia.org/r/968613

Change 968612 merged by Brouberol:

[operations/puppet@production] Enable the management of the skein certificate via Puppet

https://gerrit.wikimedia.org/r/968612

Change 968613 merged by Brouberol:

[operations/puppet@production] Enable the management of the skein certificate via Puppet on one instance

https://gerrit.wikimedia.org/r/968613

Mentioned in SAL (#wikimedia-analytics) [2023-10-31T09:26:06Z] <brouberol> I replaced the self-signed skein certificate by one issued by our cfssl PKI on an-test1002 - T329398

I've played around with the cfssl-generated chained certificate, to see whether I could have Skein accept it as a valid x509 certificate.

brouberol@an-test-client1002:~$ sudo mv /srv/airflow-analytics_test/.skein/skein.crt /srv/airflow-analytics_test/.skein/skein.crt
brouberol@an-test-client1002:~$ sudo ln -s /etc/cfssl/ssl/discovery__an-test-client1002_eqiad_wmnet/discovery__an-test-client1002_eqiad_wmnet.chained.pem /srv/airflow-analytics_test/.skein/skein.crt
brouberol@an-test-client1002:~$ sudo su - analytics
analytics@an-test-client1002:~$ source /lib/airflow/bin/activate
(airflow) analytics@an-test-client1002:~$ export HOME=srv/airflow-analytics_test
(airflow) analytics@an-test-client1002:~$ cd
(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test$ skein application ls
Error: Security cert file not found at '/srv/airflow-analytics_test/.skein/skein.crt'
Exception ignored in: <function Client.__del__ at 0x7fc6944183a0>
Traceback (most recent call last):
  File "/usr/lib/airflow/lib/python3.10/site-packages/skein/core.py", line 492, in __del__
    self.close()
  File "/usr/lib/airflow/lib/python3.10/site-packages/skein/core.py", line 479, in close
    if self._proc is not None:
AttributeError: 'Client' object has no attribute '_proc'

Skein does not seem to like having its certificate being a symlink. Fair enough. I then cp-ed the cfssl certificate to ~/.skein:

brouberol@an-test-client1002:~$ sudo cp /etc/cfssl/ssl/discovery__an-test-client1002_eqiad_wmnet/discovery__an-test-client1002_eqiad_wmnet.chained.pem /srv/airflow-analytics_test/.skein/skein.crt
(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test$ skein application ls
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
23/10/31 09:49:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/31 09:49:53 WARN skein.Driver: Kerberos ticket not found, please kinit and restart
23/10/31 09:49:55 INFO skein.Driver: Driver started, listening on 36327
E1031 09:49:55.227525641   48433 ssl_transport_security.cc:780]        Invalid private key.
E1031 09:49:55.227899176   48433 ssl_security_connector.cc:128]        Handshaker factory creation failed with TSI_INVALID_ARGUMENT.
E1031 09:49:55.228005927   48433 chttp2_connector.cc:268]              Failed to create channel args during subchannel creation: INTERNAL: Failed to create secure subchannel for secure name 'skein-internal'; Got args: {grpc.client_channel_factory=0x14bd480, grpc.default_authority=skein-internal, grpc.internal.channel_credentials=0x16bdec0, grpc.internal.event_engine=0x16b9c70, grpc.internal.subchannel_pool=0x168b080, grpc.primary_user_agent=grpc-python/1.56.0, grpc.resource_quota=0x184b100, grpc.server_uri=dns:///127.0.0.1:36327, grpc.ssl_target_name_override=skein-internal}
Error: Unable to connect to driver

I think the error comes from the fact that the skein gprc client assumes that the CN and issuer name will be skein-internal.

Actually, I realized that I had only changed the _certificate_ but not the private key..

# the original skein.crt had been restored at this point
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo mv skein.crt skein.crt.bak
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo mv skein.pem skein.pem.bak
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo cp /etc/cfssl/ssl/discovery__an-test-client1002_eqiad_wmnet/discovery__an-test-client1002_eqiad_wmnet.chained.pem /srv/airflow-analytics_test/.skein/skein.crt
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo cp /etc/cfssl/ssl/discovery__an-test-client1002_eqiad_wmnet/discovery__an-test-client1002_eqiad_wmnet-key.pem skein.pem
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo chown analytics:analytics skein.pem skein.crt
(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test$ skein application ls
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
23/10/31 10:12:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/31 10:12:22 WARN skein.Driver: Kerberos ticket not found, please kinit and restart
23/10/31 10:12:23 ERROR skein.Driver: Error running Driver
java.lang.IllegalArgumentException: Input stream does not contain valid private key.
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:296)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:236)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.forServer(SslContextBuilder.java:65)
	at com.anaconda.skein.shaded.io.grpc.netty.GrpcSslContexts.forServer(GrpcSslContexts.java:151)
	at com.anaconda.skein.Driver.startServer(Driver.java:124)
	at com.anaconda.skein.Driver.run(Driver.java:287)
	at com.anaconda.skein.Driver.main(Driver.java:175)
Caused by: java.security.spec.InvalidKeySpecException: Neither RSA, DSA nor EC worked
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContext.getPrivateKeyFromByteBuffer(SslContext.java:1046)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContext.toPrivateKey(SslContext.java:1025)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:294)
	... 6 more
Caused by: java.security.spec.InvalidKeySpecException: java.security.InvalidKeyException: IOException : version mismatch: (supported:     00, parsed:     01
	at sun.security.ec.ECKeyFactory.engineGeneratePrivate(ECKeyFactory.java:169)
	at java.security.KeyFactory.generatePrivate(KeyFactory.java:372)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContext.getPrivateKeyFromByteBuffer(SslContext.java:1044)
	... 8 more
Caused by: java.security.InvalidKeyException: IOException : version mismatch: (supported:     00, parsed:     01
	at sun.security.pkcs.PKCS8Key.decode(PKCS8Key.java:351)
	at sun.security.pkcs.PKCS8Key.decode(PKCS8Key.java:356)
	at sun.security.ec.ECPrivateKeyImpl.<init>(ECPrivateKeyImpl.java:79)
	at sun.security.ec.ECKeyFactory.implGeneratePrivate(ECKeyFactory.java:237)
	at sun.security.ec.ECKeyFactory.engineGeneratePrivate(ECKeyFactory.java:165)
	... 10 more
Error: Failed to start java process

This issue seems to be caused by skein not being able to read the elliptic curve private key format (source).

The suggested private key conversion command seemed to work better:

brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo openssl pkey -in skein.pem.nw -out skein.pkcs8.pem
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo mv skein.pem skein.pem.bak
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo mv skein.crt.pem skein.crt.bak
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo mv skein.crt.nw skein.crt
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo mv skein.pkcs8.pem skein.pem
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo chown analytics:analytics skein.pem skein.crt
(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test$ skein application ls
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
23/10/31 10:22:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/31 10:22:56 WARN skein.Driver: Kerberos ticket not found, please kinit and restart
23/10/31 10:22:58 INFO skein.Driver: Driver started, listening on 34121
Error: Unable to connect to driver

We might need to tweak the algo parameter we pass to gen_cert, to mirror the output of the skein config gencerts command, which seems to be rsa 2048b:

rouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo openssl x509 -in skein.crt -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            27:10:a7:88:8c:3a:95:7f:9e:df:16:84:86:01:dc:42:96:40:ac:1a
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = skein-internal
        Validity
            Not Before: Oct 31 10:25:55 2023 GMT
            Not After : Oct 30 10:25:55 2024 GMT
        Subject: CN = skein-internal
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
...

Change 970331 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Generate an RSA2048-encrypted private key for Skein

https://gerrit.wikimedia.org/r/970331

Change 970331 merged by Brouberol:

[operations/puppet@production] Generate an RSA 4096-encrypted private key for Skein

https://gerrit.wikimedia.org/r/970331

After Puppet generates the certificate and private key encrypted using RSA 4096, we encounter the following error when executing Skein:

(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test$ skein application ls
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
23/10/31 12:33:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/31 12:33:27 WARN skein.Driver: Kerberos ticket not found, please kinit and restart
23/10/31 12:33:28 ERROR skein.Driver: Error running Driver
java.lang.IllegalArgumentException: Input stream does not contain valid private key.
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:296)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:236)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.forServer(SslContextBuilder.java:65)
	at com.anaconda.skein.shaded.io.grpc.netty.GrpcSslContexts.forServer(GrpcSslContexts.java:151)
	at com.anaconda.skein.Driver.startServer(Driver.java:124)
	at com.anaconda.skein.Driver.run(Driver.java:287)
	at com.anaconda.skein.Driver.main(Driver.java:175)
Caused by: java.security.spec.InvalidKeySpecException: Neither RSA, DSA nor EC worked
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContext.getPrivateKeyFromByteBuffer(SslContext.java:1046)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContext.toPrivateKey(SslContext.java:1025)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:294)
	... 6 more
Caused by: java.security.spec.InvalidKeySpecException: java.security.InvalidKeyException: IOException : algid parse error, not a sequence
	at sun.security.ec.ECKeyFactory.engineGeneratePrivate(ECKeyFactory.java:169)
	at java.security.KeyFactory.generatePrivate(KeyFactory.java:372)
	at com.anaconda.skein.shaded.io.netty.handler.ssl.SslContext.getPrivateKeyFromByteBuffer(SslContext.java:1044)
	... 8 more
Caused by: java.security.InvalidKeyException: IOException : algid parse error, not a sequence
	at sun.security.pkcs.PKCS8Key.decode(PKCS8Key.java:351)
	at sun.security.pkcs.PKCS8Key.decode(PKCS8Key.java:356)
	at sun.security.ec.ECPrivateKeyImpl.<init>(ECPrivateKeyImpl.java:79)
	at sun.security.ec.ECKeyFactory.implGeneratePrivate(ECKeyFactory.java:237)
	at sun.security.ec.ECKeyFactory.engineGeneratePrivate(ECKeyFactory.java:165)
	... 10 more
Error: Failed to start java process

According to this post, this means that the private key format needs to be converted to PKCS#8.

brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo openssl pkcs8 -topk8 -in skein.pem -nocrypt -out skein.pem.pkcs8
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo mv skein.pem.pkcs8 skein.pem
brouberol@an-test-client1002:/srv/airflow-analytics_test/.skein$ sudo chown analytics:analytics skein.pem skein.crt

Which _seems_ to work:

(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test$ skein application ls
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
23/10/31 12:38:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/31 12:38:04 WARN skein.Driver: Kerberos ticket not found, please kinit and restart
23/10/31 12:38:06 INFO skein.Driver: Driver started, listening on 43387
Error: Unable to connect to driver

Change 970378 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Convert the Skein private key to the PKCS#8 format

https://gerrit.wikimedia.org/r/970378

Change 970378 merged by Brouberol:

[operations/puppet@production] Convert the Skein private key to the PKCS#8 format

https://gerrit.wikimedia.org/r/970378

Change 970403 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Fix puppet error by providing the openssl absolute path

https://gerrit.wikimedia.org/r/970403

Change 970403 merged by Brouberol:

[operations/puppet@production] Fix puppet error by providing the openssl absolute path

https://gerrit.wikimedia.org/r/970403

Change 970408 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Hide skein private key diff in puppet logs

https://gerrit.wikimedia.org/r/970408

image.png (1×2 px, 111 KB)
I'm not sure I understand why, but for as soon as I deploy the PKI-generated certificate/private key, the aqs_hourly jobs start being rescheduled indefinitely.

I'll disable puppet on an-test-client1002 for now, and re-generate self-signed skein certificates to unblock the airflow jobs.

Looking at the airflow logs, it seems that everytime I change the skein certificate to the PKI-generated one, we suddenly can't seem to find the appropriate Hive data partition:

[2023-11-01, 16:16:14 UTC] {named_hive_partition.py:94} INFO - Poking for wmf.webrequest/webrequest_source=test_text/year=2023/month=10/day=31/hour=17
[2023-11-01, 16:16:15 UTC] {taskinstance.py:1784} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE

in a loop.

This does not make a whole lot of sense to me, as I'd imagine that Skein would be involved after that, once that task found the Hive partition, to schedule the Spark job to query it.

This is what openssl s_client outputs with the Skein self-signed certificate:

(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test/.skein$ openssl s_client -connect 127.0.0.1:40831  -cert ./skein.crt -key ./skein.pem 
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN = skein-internal
verify error:num=18:self-signed certificate
verify return:1
depth=0 CN = skein-internal
verify return:1
---
Certificate chain
 0 s:CN = skein-internal
   i:CN = skein-internal
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Nov  2 09:44:11 2023 GMT; NotAfter: Nov  1 09:44:11 2024 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIICvjCCAaagAwIBAgIUB3FZ7y+SqQDLJbR5fvzts6nLjhEwDQYJKoZIhvcNAQEL
BQAwGTEXMBUGA1UEAwwOc2tlaW4taW50ZXJuYWwwHhcNMjMxMTAyMDk0NDExWhcN
MjQxMTAxMDk0NDExWjAZMRcwFQYDVQQDDA5za2Vpbi1pbnRlcm5hbDCCASIwDQYJ
KoZIhvcNAQEBBQADggEPADCCAQoCggEBAMpuSBaYy2DT4+eLu8HDlpTvw2WtxHpn
TcAvMT5eNeosfu43TYMo96ZnNOcAfyyBuFIPQUsC2djE0i6UGM/SPi2q3BALhrDW
UkXZ46rbfdkKtu8SQMQDbdfrh+VWDAPz9I/mBnaNY/ij19sxf68gGNnZS4WZFQYq
ZJTAsSRXP7UC4O7wCAf3QlwqZuambr905Osw+ePdVaIKowfSBOsbHEBuT1zX5xDI
qyuR3ejSD+xcO897yixHEGmCghO/BRqJgcSTpFqIiN+VkpXfhrklFC09molLgIc8
z+NJqLyxsy03Q/qevjagC8sXhu/A8kuGij/iSzNehb6rb4MICNWB7xkCAwEAATAN
BgkqhkiG9w0BAQsFAAOCAQEAiDujQRda3EXCHSiO2xdFKnDtkO+Gjmt4nu4hadW8
WxG+OFcLvRb32D+ocTGwZV2QTNZrL308J1o0NdQfOJpEcVVE8KSB9NgFbGTX0tnM
BnfTT0xmarKAeIhX1oDhckd1fDnf0butJHPoqMVMmMqk1eONllDjZHSFgd4JiZM1
7Dimh1atHHKLIKVgXw+tzglsv9I3ACblhniO5e9N60Zz6IUa1yxYe8i7yShe4ubS
wbfu0nK+zLz6/uOB3QL9N4+rvYJ19NXkB9eSF0MQpJo6ekArmWGu7fjt9yU+kkm2
AdR2MsNtX+uFJeFI+KbawQD4+XxVtXtmrZSRmglJQGEvfA==
-----END CERTIFICATE-----
subject=CN = skein-internal
issuer=CN = skein-internal
---
Acceptable client certificate CA names
CN = skein-internal
Client Certificate Types: RSA sign, ECDSA sign
Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512:RSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1245 bytes and written 1380 bytes
Verification error: self-signed certificate
---
New, TLSv1.2, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: E8779E866BCA9213AB6872E19F184B4758B281E8A35B69F2053D738EEF17B73E
    Session-ID-ctx: 
    Master-Key: 230ED241E174554176883DC7B594EA768A70EA74E6439B445DA5318BECB0CADFD13099B37A35DE2A887DCA46ADA8E800
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1698918366
    Timeout   : 7200 (sec)
    Verify return code: 18 (self-signed certificate)
    Extended master secret: yes
---
closed

This is what it outputs with the PKI-generated certificate:

(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test/.skein$ openssl s_client -connect 127.0.0.1:45883  -cert ./skein.crt -key ./skein.pem 
CONNECTED(00000003)
Can't use SSL_get_servername
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = an-test-client1002.eqiad.wmnet
verify return:1
80B28A06297F0000:error:0A000438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error:ssl/record/rec_layer_s3.c:1586:SSL alert number 80
---
Certificate chain
 0 s:CN = an-test-client1002.eqiad.wmnet
   i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: ecdsa-with-SHA512
   v:NotBefore: Oct 31 12:25:00 2023 GMT; NotAfter: Nov 28 12:25:00 2023 GMT
 1 s:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
   i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
   a:PKEY: id-ecPublicKey, 521 (bit); sigalg: ecdsa-with-SHA512
   v:NotBefore: May  4 13:54:00 2021 GMT; NotAfter: May  3 13:54:00 2026 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIE2DCCBDmgAwIBAgIUM0bmd3mx85GHAukgsJDnYcYLzP0wCgYIKoZIzj0EAwQw
dzELMAkGA1UEBhMCVVMxFjAUBgNVBAcTDVNhbiBGcmFuY2lzY28xIjAgBgNVBAoT
GVdpa2ltZWRpYSBGb3VuZGF0aW9uLCBJbmMxGDAWBgNVBAsTD1NSRSBGb3VuZGF0
aW9uczESMBAGA1UEAxMJZGlzY292ZXJ5MB4XDTIzMTAzMTEyMjUwMFoXDTIzMTEy
ODEyMjUwMFowKTEnMCUGA1UEAxMeYW4tdGVzdC1jbGllbnQxMDAyLmVxaWFkLndt
bmV0MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA2HINLINAp6yH7zc9
S4YJsxHt+lMfXNTUa5gSjK58QVjKfZN4KYZ8xIEIViyqaG/OuJwkATY45t+Bru26
vsKXSxLfAmts2me3XnI97QONAjuqgDoO3Hn/fvmHcCAcIC9O+htDtLotx8ns/2ws
WIMrUyAGTlbq5IpHZ0v2n0ifIT2Eq7Dxl/5gq24snUjyRbyyPNwNTNiguy5+Grt/
cwo5iNv6j8QAoOu7jamQf/rrYVsWL8rKBO19WC3fNFMKQaw2jewK77C4aurDnCsq
14APTJlqbmPDeHFvAxorra7wU3tcoUYCHeEjQBsvEbLPIZ4Xw1DrXUNwt6M1EMux
8Omb6sngdNm42hzNcelfWt6k9L/h93UXpfJtec6r1ooAaM4yVgUY9CdGtfpseX5w
hbSVfG6KQnAnxaatbEwpUC1NgnpjFT6iWC/gjxTzyx+wpyNpdAZBTFHefB/0dllR
T8EcCRpYTvSHK6iahoE0ZDXdAvkqlN26rKRImSPKdFcT9GKRvD5cQAQ+rIfNpwUR
xUsmEmVhCcZINoRURYyFmbIAOL0vDB2tZ0nM+wxqGJIxW6ZdZhBBy8Zcp1Hkxgp/
ACqTLbr8fmbMtE5oZzt0+luMGqoLnz1EhmqBRbF9r0cxwHx+MEl25SNJnKiY4WqU
bJ1Yra44iKosyKsqS6UKe7gUxk0CAwEAAaOCASQwggEgMA4GA1UdDwEB/wQEAwIF
oDATBgNVHSUEDDAKBggrBgEFBQcDATAMBgNVHRMBAf8EAjAAMB0GA1UdDgQWBBSW
8q5NcrfEvtrHKVGfzulqHb4lxjAfBgNVHSMEGDAWgBSJ76Rw5IjvCdxrdGkh4PHU
L7PCBjBFBggrBgEFBQcBAQQ5MDcwNQYIKwYBBQUHMAGGKWh0dHA6Ly9wa2kuZGlz
Y292ZXJ5LndtbmV0L29jc3AvZGlzY292ZXJ5MCkGA1UdEQQiMCCCHmFuLXRlc3Qt
Y2xpZW50MTAwMi5lcWlhZC53bW5ldDA5BgNVHR8EMjAwMC6gLKAqhihodHRwOi8v
cGtpLmRpc2NvdmVyeS53bW5ldC9jcmwvZGlzY292ZXJ5MAoGCCqGSM49BAMEA4GM
ADCBiAJCAI+5IJ3w9JY95d0DldpOIhZpCzcIvkxqHkbKCHx4S2wwh73qTEpjKqrE
hL//Z00p0d1os2/bJC6Mu4Ay+slUUO67AkIBrvvuqRmamLwOLkYg5w6igLeMgTrt
/YSVm7hENJWy4sGosuSIhkpxCa4gRpVXPfBvtrZkHKz15kGtB8CaLqJu1CI=
-----END CERTIFICATE-----
subject=CN = an-test-client1002.eqiad.wmnet
issuer=C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
---
Acceptable client certificate CA names
C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
CN = an-test-client1002.eqiad.wmnet
Client Certificate Types: RSA sign, ECDSA sign
Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512:RSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 3081 bytes and written 1380 bytes
Verification error: unable to get local issuer certificate
---
New, TLSv1.2, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 4096 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: 96EB00B26422963BD6E8A4AA392550A4200E192BD032B17D54A5F55E70EE7B97
    Session-ID-ctx: 
    Master-Key: F3350FBAA23566EF719A98F15DB0953CD149133C5D01303BE42B11DF7562E49700884E18916BE878846CC6E4DFA5654B
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1698918253
    Timeout   : 7200 (sec)
    Verify return code: 20 (unable to get local issuer certificate)
    Extended master secret: yes
---

I wonder if we'd need to add the intermediate CA certificate to the java truststore.

I also wonder if we couldn't just regenerate these certificates with a systemd timer running skein config gencerts --force every month.

I've tried to import intermediate CA into a JKS truststore, and pass options to the Skein driver pointing to this truststore, to no avail.

brouberol@an-test-client1002:~$ sudo keytool -import -trustcacerts -alias wmf-discovery -keystore skein.key -file /etc/cfssl/ssl/discovery__an-test-client1002_eqiad_wmnet/discovery__an-test-client1002_eqiad_wmnet.chain.pem
brouberol@an-test-client1002:~$ sudo chown analytics:analytics skein.key 
brouberol@an-test-client1002:~$ sudo mv skein.key /srv/airflow-analytics_test/.skein/skein.key
brouberol@an-test-client1002:~$ keytool -list -keystore /srv/airflow-analytics_test/.skein/skein.key
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Enter keystore password:  
Keystore type: jks
Keystore provider: SUN

Your keystore contains 1 entry

wmf-discovery, Nov 2, 2023, trustedCertEntry, 
Certificate fingerprint (SHA-256): CA:0F:21:6D:60:45:F0:BC:A0:24:AD:CB:D1:41:07:11:CD:D2:EB:A6:08:18:9D:B1:8C:FA:FB:22:6B:99:39:A1
airflow) analytics@an-test-client1002:/srv/airflow-analytics_test/.skein$ skein driver start --java-option='-Djavax.net.ssl.trustStore=/srv/airflow-analytics_test/.skein/skein.key' --java-option='-Djavax.net.ssl.trustStorePassword=plopplop' 
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
127.0.0.1:32961
brouberol@an-test-client1002:~$ sudo lsof -i tcp:32961
COMMAND     PID      USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
java    1023433 analytics  377u  IPv6 38569315      0t0  TCP localhost:32961 (LISTEN)
brouberol@an-test-client1002:~$ sudo cat /proc/1023433/cmdline | tr '\0' ' '
java -Dskein.log.level=INFO -Djavax.net.ssl.trustStore=/srv/airflow-analytics_test/.skein/skein.key -Djavax.net.ssl.trustStorePassword=plopplop com.anaconda.skein.Driver --jar /usr/lib/airflow/lib/python3.10/site-packages/skein/java/skein.jar --daemon
(airflow) analytics@an-test-client1002:/srv/airflow-analytics_test/.skein$ openssl s_client -connect 127.0.0.1:32961  -cert ./skein.crt -key ./skein.pem 
CONNECTED(00000003)
Can't use SSL_get_servername
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = an-test-client1002.eqiad.wmnet
verify return:1
---
Certificate chain
 0 s:CN = an-test-client1002.eqiad.wmnet
   i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: ecdsa-with-SHA512
   v:NotBefore: Oct 31 12:25:00 2023 GMT; NotAfter: Nov 28 12:25:00 2023 GMT
 1 s:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
   i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
   a:PKEY: id-ecPublicKey, 521 (bit); sigalg: ecdsa-with-SHA512
   v:NotBefore: May  4 13:54:00 2021 GMT; NotAfter: May  3 13:54:00 2026 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIE2DCCBDmgAwIBAgIUM0bmd3mx85GHAukgsJDnYcYLzP0wCgYIKoZIzj0EAwQw
dzELMAkGA1UEBhMCVVMxFjAUBgNVBAcTDVNhbiBGcmFuY2lzY28xIjAgBgNVBAoT
GVdpa2ltZWRpYSBGb3VuZGF0aW9uLCBJbmMxGDAWBgNVBAsTD1NSRSBGb3VuZGF0
aW9uczESMBAGA1UEAxMJZGlzY292ZXJ5MB4XDTIzMTAzMTEyMjUwMFoXDTIzMTEy
ODEyMjUwMFowKTEnMCUGA1UEAxMeYW4tdGVzdC1jbGllbnQxMDAyLmVxaWFkLndt
bmV0MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA2HINLINAp6yH7zc9
S4YJsxHt+lMfXNTUa5gSjK58QVjKfZN4KYZ8xIEIViyqaG/OuJwkATY45t+Bru26
vsKXSxLfAmts2me3XnI97QONAjuqgDoO3Hn/fvmHcCAcIC9O+htDtLotx8ns/2ws
WIMrUyAGTlbq5IpHZ0v2n0ifIT2Eq7Dxl/5gq24snUjyRbyyPNwNTNiguy5+Grt/
cwo5iNv6j8QAoOu7jamQf/rrYVsWL8rKBO19WC3fNFMKQaw2jewK77C4aurDnCsq
14APTJlqbmPDeHFvAxorra7wU3tcoUYCHeEjQBsvEbLPIZ4Xw1DrXUNwt6M1EMux
8Omb6sngdNm42hzNcelfWt6k9L/h93UXpfJtec6r1ooAaM4yVgUY9CdGtfpseX5w
hbSVfG6KQnAnxaatbEwpUC1NgnpjFT6iWC/gjxTzyx+wpyNpdAZBTFHefB/0dllR
T8EcCRpYTvSHK6iahoE0ZDXdAvkqlN26rKRImSPKdFcT9GKRvD5cQAQ+rIfNpwUR
xUsmEmVhCcZINoRURYyFmbIAOL0vDB2tZ0nM+wxqGJIxW6ZdZhBBy8Zcp1Hkxgp/
ACqTLbr8fmbMtE5oZzt0+luMGqoLnz1EhmqBRbF9r0cxwHx+MEl25SNJnKiY4WqU
bJ1Yra44iKosyKsqS6UKe7gUxk0CAwEAAaOCASQwggEgMA4GA1UdDwEB/wQEAwIF
oDATBgNVHSUEDDAKBggrBgEFBQcDATAMBgNVHRMBAf8EAjAAMB0GA1UdDgQWBBSW
8q5NcrfEvtrHKVGfzulqHb4lxjAfBgNVHSMEGDAWgBSJ76Rw5IjvCdxrdGkh4PHU
L7PCBjBFBggrBgEFBQcBAQQ5MDcwNQYIKwYBBQUHMAGGKWh0dHA6Ly9wa2kuZGlz
Y292ZXJ5LndtbmV0L29jc3AvZGlzY292ZXJ5MCkGA1UdEQQiMCCCHmFuLXRlc3Qt
Y2xpZW50MTAwMi5lcWlhZC53bW5ldDA5BgNVHR8EMjAwMC6gLKAqhihodHRwOi8v
cGtpLmRpc2NvdmVyeS53bW5ldC9jcmwvZGlzY292ZXJ5MAoGCCqGSM49BAMEA4GM
ADCBiAJCAI+5IJ3w9JY95d0DldpOIhZpCzcIvkxqHkbKCHx4S2wwh73qTEpjKqrE
hL//Z00p0d1os2/bJC6Mu4Ay+slUUO67AkIBrvvuqRmamLwOLkYg5w6igLeMgTrt
/YSVm7hENJWy4sGosuSIhkpxCa4gRpVXPfBvtrZkHKz15kGtB8CaLqJu1CI=
-----END CERTIFICATE-----
subject=CN = an-test-client1002.eqiad.wmnet
issuer=C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
---
Acceptable client certificate CA names
C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = discovery
CN = an-test-client1002.eqiad.wmnet
Client Certificate Types: RSA sign, ECDSA sign
Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512:RSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA256:RSA-PSS+SHA256:RSA+SHA256:ECDSA+SHA384:RSA-PSS+SHA384:RSA+SHA384:RSA-PSS+SHA512:RSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 3125 bytes and written 2174 bytes
Verification error: unable to get local issuer certificate
---
New, TLSv1.2, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 4096 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: 0DD45B5F315C2CC5999F6086F76BA8065F143B4B816D47FBABDAB65E27FBA301
    Session-ID-ctx: 
    Master-Key: 2088C2E9E6524C01DD96B0901FEAE6BD89CEE9C4BB1244D5D9CF46CD30AD7C048582048730855F5305C1870DC8B31499
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1698922580
    Timeout   : 7200 (sec)
    Verify return code: 20 (unable to get local issuer certificate)
    Extended master secret: yes
---

No dice.

Change 971196 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Renew skein certificate every month via systemd timers

https://gerrit.wikimedia.org/r/971196

Gehel triaged this task as Medium priority.Nov 3 2023, 10:27 AM

Change 970408 abandoned by Brouberol:

[operations/puppet@production] Hide skein private key diff in puppet logs

Reason:

I'm closing this PR as part of an overall simplification of how we renew skein certificates: https://gerrit.wikimedia.org/r/c/operations/puppet/+/971196

https://gerrit.wikimedia.org/r/970408

Change 971196 merged by Brouberol:

[operations/puppet@production] Renew skein certificate every month via systemd timers

https://gerrit.wikimedia.org/r/971196

The simpler avenue of using systemd timers seems to work nicely:

brouberol@an-test-client1002:~$ sudo openssl x509 -in /srv/airflow-analytics_test/.skein/skein.crt -text | grep After
            Not After : Nov  1 14:05:20 2024 GMT
brouberol@an-test-client1002:~$ sudo systemctl cat regenerate-skein-certificate.timer 
# /lib/systemd/system/regenerate-skein-certificate.timer
[Unit]
Description=Periodic execution of regenerate-skein-certificate.service

[Timer]
Unit=regenerate-skein-certificate.service
# Accuracy sets the maximum time interval around the execution time we want to allow
AccuracySec=15sec
OnCalendar=monthly
RandomizedDelaySec=0

[Install]
WantedBy=multi-user.target
brouberol@an-test-client1002:~$ sudo systemctl cat regenerate-skein-certificate.service
# /lib/systemd/system/regenerate-skein-certificate.service
[Unit]
Description=refresh the x509 self-signed Skein certificate
Documentation=https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

[Service]
Type=oneshot
User=analytics
Environment="HOME=/srv/airflow-analytics_test"
ExecStart=/lib/airflow/bin/skein config gencerts --force
brouberol@an-test-client1002:~$ sudo systemctl start regenerate-skein-certificate.service
brouberol@an-test-client1002:~$ echo $?
0
brouberol@an-test-client1002:~$ sudo openssl x509 -in /srv/airflow-analytics_test/.skein/skein.crt -text | grep After
            Not After : Nov  5 13:08:34 2024 GMT

The expiry date went from Nov 1 14:05:20 2024 GMT to Nov 5 13:08:34 2024 GMT.

Change 971947 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Enable monthly skein certificate renewal on airflow launchers

https://gerrit.wikimedia.org/r/971947

Change 971947 merged by Brouberol:

[operations/puppet@production] Enable monthly skein certificate renewal on airflow launchers

https://gerrit.wikimedia.org/r/971947

I've enabled monthly renewal of skein certificates (and we now also have alerting based on new prometheus metrics reflecting the certificate expiration date, as a second line of safety). We can close.

Change 972397 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Fix: make sure to enable skein certificate renewal on airflow launchers

https://gerrit.wikimedia.org/r/972397

Change 972397 merged by Brouberol:

[operations/puppet@production] Fix: make sure to enable skein certificate renewal on airflow launchers

https://gerrit.wikimedia.org/r/972397