Page MenuHomePhabricator

Encrypt Airflow connections to AQS Cassandra
Closed, ResolvedPublic

Description

Hi folks!

When the AQS cluster is migrated to use PKI TLS certs (I'll post a message in this task) we should force Airflow to load data via TLS. After a chat with Ben, the following came up: default args for Cassandra don't mention any SSL/TLS setting, at least according to the Spark Cassandra connector's docs.

AQS Cassandra supports both encrypted and unencrypted clients, we should force TLS in the Airflow configs as much as possible.

Event Timeline

Eevans triaged this task as Medium priority.Apr 15 2024, 11:48 PM

@JAllemandou Hi! I have a question for you when you have a moment :)

To make this work, we need to deploy a Java Truststore that contains the right CA certs to a place where the Spark Cassandra Connector is able to read it and use it. With Puppet we can deploy it to any host (hadoop workers, airflow nodes, etc..) but I am wondering if it is enough or if we need to explicitly copy it to HDFS and force Spark to fetch it. What do you think?

I'm not sure if the spark-cassandra-connector can read a Java Truststore on HDFS! I'd go for an automated deployment of the trustore on every cluster host. For the moment it'll be enough as our prod jobs are launched fron the cluster (skein). It would probably also be good to have the truststore deployed on stat machines, to allow for manual runs. This should be enough for now, until we move launchers away from skein to use k8s - We'll revisit at that time (ping @BTullis :)

Change #1023454 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Deploy the Java Truststore with PKI Root CA on Hadoop and Stat nodes

https://gerrit.wikimedia.org/r/1023454

Filed a change for the stat nodes, the hadoop worker nodes already have the truststore!

The file is deployed on all nodes to /etc/ssl/localcerts/wmf-java-cacerts, the pass is changeit (most of java app need it, but having it in clear will not allow for any manipulation of its content, it is just a convenient/necessary setting to add).

Also I confirm that AQS Cassandra runs now with PKI TLS certs, so we can start encrypting TLS connections anytime.

Change #1023454 merged by Elukey:

[operations/puppet@production] Deploy the Java Truststore with PKI Root CA on Stat nodes

https://gerrit.wikimedia.org/r/1023454

Change #1026964 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the wmf-java-cacerts truststore to all remaining airflow hosts

https://gerrit.wikimedia.org/r/1026964

I believe that this is ready to go.
We will have deploy this https://gerrit.wikimedia.org/r/1026964 first, which will deploy the truststore to all of the remaining airflow instances.

I think that the only other instance to load to cassandra is the platform_eng instance on an-airflow1004, but I think we want to put the truststore on all instances anyway.

Once that is rolled out, I believe that we can then roll out the change to wmf_airflow_common/config/dag_default_args.py to enable TLS for all Cassandra DAGs.

Change #1026964 merged by Btullis:

[operations/puppet@production] Add the wmf-java-cacerts truststore to all remaining airflow hosts

https://gerrit.wikimedia.org/r/1026964

The first two DAG runs failed with an error like this:

Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 3 nodes, use getAllErrors() for more): Node(endPoint=aqs1012-a.eqiad.wmnet/10.64.32.128:9042, hostId=null, hashCode=5306ac2d): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|id: 0xf56a9051, L:/10.64.161.4:49460 - R:aqs1012-a.eqiad.wmnet/10.64.32.128:9042] Protocol initialization request, step 1 (OPTIONS): failed to send request (javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure)], Node(endPoint=aqs1010-a.eqiad.wmnet/10.64.0.88:9042, hostId=null, hashCode=27a13cbc): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|id: 0x846635d9, L:/10.64.161.4:36012 - R:aqs1010-a.eqiad.wmnet/10.64.0.88:9042] Protocol initialization request, step 1 (OPTIONS): failed to send request (javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure)], Node(endPoint=aqs1011-a.eqiad.wmnet/10.64.16.204:9042, hostId=null, hashCode=2f77f074):

We have paused the DAG and are investigating. There is a revert MR here, should that be necessary: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/684

Mentioned in SAL (#wikimedia-analytics) [2024-05-07T14:03:40Z] <btullis> deploying airflow analytics instance for T362181 to fix cassandra cipher list

Mentioned in SAL (#wikimedia-analytics) [2024-05-07T14:05:42Z] <btullis> unpaused cassandra_load_pageview_per_project_hourly for T362181

This is now working as expected.

image.png (568×1 px, 122 KB)

Thanks again @elukey for all your help to get this working.

Niceee thanks a lot for all the work!