Page MenuHomePhabricator

Enable encryption and authentication for TLS-based Hadoop services
Closed, ResolvedPublic21 Estimated Story Points

Description

Part of the documentation related to securing Hadoop is concerned to enable SSL/TLS encryption and authentication for TLS-based services, like:

  • UI HTTP web interfaces
  • Mapreduce/Spark shuffle services
  • Namenode/Journalnode protocol for the HDFS edit log

The main idea is to deploy a Java keystore and a Java trustore on each node that will need to use a TLS certificate (more precisely: where a daemon related to a service will need to use TLS). High level plan:

  • review all protocols that can benefit from TLS auth/encryption and establish the ones that need to be migrated to TLS. A good indicator for encryption is if sniffing traffic for a certain port leads to PII data.
  • create TLS certificates via cergen in the puppet private repo, together with trustores and keystores
  • add support in puppet to deploy trustores/keystores and the related ssl-client.xml and ssl-server.xml Hadoop configuration files.
  • roll out the changes to the Hadoop testing cluster and then to the Analytics one.

Interesting links:

https://risdenk.github.io/2018/11/15/apache-hadoop-tls-ssl-notes.html

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+3 -0
operations/puppetproduction+19 -6
operations/puppetproduction+0 -4
operations/puppetproduction+0 -6
operations/puppetproduction+0 -5
operations/puppet/cdhmaster+4 -4
operations/puppet/cdhmaster+21 -4
operations/puppetproduction+31 -50
operations/puppetproduction+4 -24
operations/puppetproduction+16 -1
operations/puppetproduction+11 -1
operations/puppet/cdhmaster+10 -10
operations/puppet/cdhmaster+32 -7
operations/puppetproduction+3 -3
operations/puppetproduction+18 -0
operations/puppetproduction+6 -1
operations/puppetproduction+2 -0
operations/puppetproduction+3 -0
operations/puppetproduction+4 -0
operations/puppetproduction+6 -0
operations/puppetproduction+26 -16
operations/puppetproduction+1 -1
operations/puppet/cdhmaster+5 -5
operations/puppetproduction+13 -0
operations/puppetproduction+1 -1
operations/puppetproduction+102 -10
operations/puppetproduction+18 -3
operations/puppet/cdhmaster+4 -16
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
elukey triaged this task as High priority.Mar 1 2019, 3:08 PM

Change 493693 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: allow the configuration of ssl-(server|client).xml configs

https://gerrit.wikimedia.org/r/493693

First attempt to make a sane/configurable TLS certificate deployment is in https://gerrit.wikimedia.org/r/493693. The following is essential to be kept in mind:

Ensure that common name (CN) matches exactly with the fully qualified domain name (FQDN) of the server. The client compares the CN with the DNS domain name to ensure that it is indeed connecting to the desired server, not the malicious one.

https://it.hortonworks.com/blog/deploying-https-hdfs/

The idea is to have something like the following in the private repo:

modules/secret/etc../certficates/hadoop_analytics-test-hadoop_eqiad/analytics1028.eqiad.wmnet/... (where the last ... is a placeholder for truststore, keystore, etc..)
...
modules/secret/etc../certficates/hadoop_analytics-test-hadoop_eqiad/analytics1041.eqiad.wmnet/...

So essentially one TLS certificate, generated via cergen, for each master/worker hadoop node in the testing cluster. Then something like the following in hiera:

  • private repo

hadoop_secrets_clusters:

analytics-test-hadoop:
    ssl_keystore_keypassword: batman
    ssl_keystore_password: batman2
    ssl_trustore_password: batman3
  • public repo:

hadoop_clusters:

analytics-test-hadoop:
    ensure_ssl_config: true

The idea is to deploy trustores/keystores using the same passwords, and then have a global way to share the config rather than repeating over and over for each role (we could also think about passwords for each role, I wouldn't mind, only a bit more cumbersome).

The above would only take care of deploying the trustores/keystores and ssl-(client|server).xml configs for all the Hadoop host of the testing cluster, then we'd need to turn TLS settings on via regular Yarn/HDFS hiera config.

@Ottomata Let me know your thoughts, this is only a proposal, completely open to change/discuss etc.. :)

Sounds great! Only comment I would make is to leave out the _eqiad bit in the private repo cergen/certificates dir hierarchy. It should probably just be hadoop_${cluster_name}. If we do have an eqiad/codfw Hadoop cluster in the future, we'd likely include the DC name in the $cluster_name, e.g. 'hdfs://analytics-eqiad' or 'hdfs://ml-codfw'.

Sounds great! Only comment I would make is to leave out the _eqiad bit in the private repo cergen/certificates dir hierarchy. It should probably just be hadoop_${cluster_name}. If we do have an eqiad/codfw Hadoop cluster in the future, we'd likely include the DC name in the $cluster_name, e.g. 'hdfs://analytics-eqiad' or 'hdfs://ml-codfw'.

+1! Thanks! Will amend the code review :)

Change 494739 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hadoop::ssl_config: remove redundant xml.erb files

https://gerrit.wikimedia.org/r/494739

Change 494739 merged by Elukey:
[operations/puppet/cdh@master] hadoop::ssl_config: remove redundant xml.erb files

https://gerrit.wikimedia.org/r/494739

The next step is to figure out how to deploy the Java trustore (where the TLS CA's certificate is) and the keystore (where TLS public/private key for the host are stored). An important note to remember is the following (talks about TLS communication between Hadoop daemons/hosts):

Ensure that common name (CN) matches exactly with the fully qualified domain name (FQDN) of the server. The client compares the CN with the DNS domain name to ensure that it is indeed connecting to the desired server, not the malicious one.

There are two possible roads that I can see:

  • use the Puppet CA, and since the certificates for every hostname are already generated for puppet, re-use them creating on the fly kestore/trustore when needed (namely when set with puppet) for each Hadoop host. This would allow to avoid using extra TLS certificates for Hadoop, generating on the fly with an exec trustore/keystore when needed. This would of course be perfect for maintenance since puppet would take care of everything, no extra steps involved for example when adding new nodes, etc.. The big downside is of course re-using the puppet certificates/keys for the host. They would be guarded by a password when stored in the keystore, but the password needs to be added to the hadoop config to allow its daemons to read it. Even with good file permissions there is the chance that an exploit of a Hadoop daemon grants access to the host's puppet certificate.
  • use a self signed CA, and generate one certificate for each hostname via cergen, and deploy them via puppet. This is more cumbersome maintenance-wise but it would completely separates concerns from puppet.

@MoritzMuehlenhoff @chasemp thoughts?

use a self signed CA

Another option would be to use the Puppet CA to sign cergen created certificates. The truststore doesn't need the private key of the CA, so this shoudln't have any security problems.

use a self signed CA

Another option would be to use the Puppet CA to sign cergen created certificates. The truststore doesn't need the private key of the CA, so this shoudln't have any security problems.

I might have some confusion in mind but what I thought was:

  • the trustore only needs the public key of the CA to trust
  • the keystore needs private key + cert of the host

My main doubt about using the puppet CA to sign cergen certs is that I'd need for example to issue certificates for name.eqiad.wmnet, that are already taken by puppet. Is there another way?

Ah I see! Is that a problem? Can a CA not create multiple certificates with the same CN?

Ah I see! Is that a problem? Can a CA not create multiple certificates with the same CN?

I have to admit my ignorance, but I'd bet on no since you'd have two certs that could identify themself for the same CN, that would be bad..

use a self signed CA, and generate one certificate for each hostname via cergen, and deploy them via puppet. This is more cumbersome maintenance-wise but it would completely separates concerns from puppet.

Luca and I discussed this a bit on IRC a bit and my preference is the approach above. This avoids a number of risks related to Hadoop and for Kerberos we'll most definitely need some manual steps anyway on the Kerberos side to add a new Hadoop node to the cluster anyway.

Change 496127 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::kafka::broker: move java.security settings to its own class

https://gerrit.wikimedia.org/r/496127

Change 496127 merged by Elukey:
[operations/puppet@production] profile::kafka::broker: move java.security settings to its own class

https://gerrit.wikimedia.org/r/496127

Change 493693 merged by Elukey:
[operations/puppet@production] hadoop: allow the configuration of ssl-(server|client).xml configs

https://gerrit.wikimedia.org/r/493693

Change 496401 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add TLS configuration to the Hadoop testing cluster

https://gerrit.wikimedia.org/r/496401

Change 496402 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: fix a missing '.'

https://gerrit.wikimedia.org/r/496402

Change 496402 merged by Elukey:
[operations/puppet@production] profile::hadoop::common: fix a missing '.'

https://gerrit.wikimedia.org/r/496402

Change 496401 merged by Elukey:
[operations/puppet@production] Add TLS configuration to the Hadoop testing cluster

https://gerrit.wikimedia.org/r/496401

Change 496426 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hadoop::ssl_config: separate ssl client/server configs

https://gerrit.wikimedia.org/r/496426

Change 496426 merged by Elukey:
[operations/puppet/cdh@master] hadoop::ssl_config: separate ssl client/server configs

https://gerrit.wikimedia.org/r/496426

Change 496428 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update cdh module to its latest version

https://gerrit.wikimedia.org/r/496428

Change 496428 merged by Elukey:
[operations/puppet@production] Update cdh module to its latest version

https://gerrit.wikimedia.org/r/496428

Change 496430 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::firewall::master: break down ssl config

https://gerrit.wikimedia.org/r/496430

Change 496430 merged by Elukey:
[operations/puppet@production] profile::hadoop::firewall::master: break down ssl config

https://gerrit.wikimedia.org/r/496430

Change 496486 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: add ssl parameter to ssl-config.xml's set

https://gerrit.wikimedia.org/r/496486

Change 496725 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add Yarn TLS settings to the Hadoop testing cluster

https://gerrit.wikimedia.org/r/496725

Change 496725 merged by Elukey:
[operations/puppet@production] Add Yarn TLS settings to the Hadoop testing cluster

https://gerrit.wikimedia.org/r/496725

Change 496727 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::hadoop::standby: set Yarn TLS settings

https://gerrit.wikimedia.org/r/496727

Change 496727 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::hadoop::standby: set Yarn TLS settings

https://gerrit.wikimedia.org/r/496727

Change 496733 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add mapreduce.ssl.enabled setting to the Hadoop test cluster

https://gerrit.wikimedia.org/r/496733

Change 496733 merged by Elukey:
[operations/puppet@production] Add mapreduce.ssl.enabled setting to the Hadoop test cluster

https://gerrit.wikimedia.org/r/496733

Change 496743 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable TLS configuration for mapreduce on Hadoop Test masters

https://gerrit.wikimedia.org/r/496743

Change 496743 merged by Elukey:
[operations/puppet@production] Enable TLS configuration for mapreduce on Hadoop Test masters

https://gerrit.wikimedia.org/r/496743

Change 496747 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add a core-site.xml global property to the Hadoop test cluster

https://gerrit.wikimedia.org/r/496747

Change 496747 merged by Elukey:
[operations/puppet@production] Add a core-site.xml global property to the Hadoop test cluster

https://gerrit.wikimedia.org/r/496747

Change 496775 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add core-site.xml TLS properties to Hadoop Test Analytics

https://gerrit.wikimedia.org/r/496775

Change 496775 merged by Elukey:
[operations/puppet@production] Add core-site.xml TLS properties to Hadoop Test Analytics

https://gerrit.wikimedia.org/r/496775

Change 496784 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix TLS parameter in Hadoop Test Analytics config

https://gerrit.wikimedia.org/r/496784

Change 496784 merged by Elukey:
[operations/puppet@production] Fix TLS parameter in Hadoop Test Analytics config

https://gerrit.wikimedia.org/r/496784

Change 497264 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hue: add class parameters to configure Yarn/HDFS/MapRed TLS ports

https://gerrit.wikimedia.org/r/497264

Change 497264 merged by Elukey:
[operations/puppet/cdh@master] hue: add class parameters to configure Yarn/HDFS/MapRed TLS ports

https://gerrit.wikimedia.org/r/497264

Change 497267 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hue: configure Yarn|HDFS|MapRed SSL ports for analytics1039

https://gerrit.wikimedia.org/r/497267

Change 497270 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hue: change name of the hdfs/yarn/mapred tls config

https://gerrit.wikimedia.org/r/497270

Change 497270 merged by Elukey:
[operations/puppet/cdh@master] hue: change name of the hdfs/yarn/mapred tls config

https://gerrit.wikimedia.org/r/497270

Change 497267 merged by Elukey:
[operations/puppet@production] profile::hue: configure Yarn|HDFS|MapRed SSL ports for analytics1039

https://gerrit.wikimedia.org/r/497267

Change 498329 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add Hadoop TLS config to analytics1037

https://gerrit.wikimedia.org/r/498329

Change 498329 merged by Elukey:
[operations/puppet@production] Add Hadoop TLS config to analytics1037

https://gerrit.wikimedia.org/r/498329

Change 498333 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Rely on Hadoop defaults for the TLS config of the Analytics Test cluster

https://gerrit.wikimedia.org/r/498333

Change 498333 merged by Elukey:
[operations/puppet@production] Rely on Hadoop defaults for the TLS config of the Analytics Test cluster

https://gerrit.wikimedia.org/r/498333

Change 498365 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: explicitly set if TLS keys are deployed or not

https://gerrit.wikimedia.org/r/498365

Change 498365 merged by Elukey:
[operations/puppet@production] profile::hadoop::common: explicitly set if TLS keys are deployed or not

https://gerrit.wikimedia.org/r/498365

Change 498375 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hue: add ssl_ca_certs config tunable and https/http variations

https://gerrit.wikimedia.org/r/498375

Change 498375 merged by Elukey:
[operations/puppet/cdh@master] hue: add ssl_ca_certs config tunable and https/http variations

https://gerrit.wikimedia.org/r/498375

Change 499418 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hadoop::ssl_config: fix permissions for xml files following CDH guidelines

https://gerrit.wikimedia.org/r/499418

Change 499418 merged by Elukey:
[operations/puppet/cdh@master] hadoop::ssl_config: fix permissions for xml files following CDH guidelines

https://gerrit.wikimedia.org/r/499418

elukey lowered the priority of this task from High to Medium.Apr 1 2019, 1:19 PM

Summary of the work done up to now:

  • Puppet groundwork to deploy trustore/keystore on Hadoop masters/workers and related ssl-server|client.xml configs done.
  • The above was used to encrypt shuffle HTTP traffic (to allow map-reduce's reducers to pull data from mappers via HTTPs).
  • Enabling TLS for HTTP UIs (Yarn, HDFS, etc..) seems to be possible but a bit of an hassle since using self signed certificates means that the CA's certificate needs to be deployed on all clients to allow a proper verification. Hue is a notable example: it uses Yarn's HTTP port to pull data related jobs and their logs, but it fails if the self signed CA cert is not deployed on the same host of course.

The rest of the Hadoop traffic uses RPC that is encrypted/authenticated by other means, and it requires Kerberos. I think that encrypting the shuffler's HTTP traffic is desirable (PII data) but I cannot say the same for the UIs traffic (could be a plus to do later on down the road).

Change 504275 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove HTTPS config from the Hadoop testing cluster

https://gerrit.wikimedia.org/r/504275

Change 504275 merged by Elukey:
[operations/puppet@production] Remove HTTPS config from the Hadoop testing cluster

https://gerrit.wikimedia.org/r/504275

Change 504277 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::an_test_cluster::hadoop::master|standby: unset ferm TLS config

https://gerrit.wikimedia.org/r/504277

Change 504277 merged by Elukey:
[operations/puppet@production] role::an_test_cluster::hadoop::master|standby: unset ferm TLS config

https://gerrit.wikimedia.org/r/504277

Change 504278 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::hadoop::ui: remove yarn/hdfs/mr TLS conf

https://gerrit.wikimedia.org/r/504278

Change 504278 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::hadoop::ui: remove yarn/hdfs/mr TLS conf

https://gerrit.wikimedia.org/r/504278

Change 496486 abandoned by Elukey:
profile::hadoop::common: add ssl parameter to ssl-config.xml's set

https://gerrit.wikimedia.org/r/496486

Change 517983 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add dfs.http.policy: 'HTTPS_ONLY' to the Hadoop test cluster

https://gerrit.wikimedia.org/r/517983

Change 517983 merged by Elukey:
[operations/puppet@production] Add dfs.http.policy: 'HTTPS_ONLY' to the Hadoop test cluster

https://gerrit.wikimedia.org/r/517983

We have carefully selected what services needed TLS, and which ones could live without it. HDFS requires it when enabling kerberos and encryption of the data transfer protocol, meanwhile it is desirable for the yarn shuffler since PII data flows from hadoop workers thanks to it. The only main question is if we need to take a step further and require TLS client authentication for the shuffler (so a rogue client cannot issue GET queries and sniff PII data while in shuffling step).

https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html

Client Certificates
Using Client Certificates does not fully ensure that the client is a reducer task for the job. Currently, Client Certificates (their private key) keystore files must be readable by all users submitting jobs to the cluster. This means that a rogue job could read such those keystore files and use the client certificates in them to establish a secure connection with a Shuffle server. However, unless the rogue job has a proper JobToken, it won’t be able to retrieve shuffle data from the Shuffle server. A job, using its own JobToken, can only retrieve shuffle data that belongs to itself.

Just tried to test it doing some curl to GET /mapOutput?job=job_1561367702623_49144&reduce=0&map=attempt_1561367702623_49144_m_000023_0 HTTP/1.1, but it returns "incompatible mapreduce version" and Yarn logs stuff like "can't find job token for etc.., unknown_ca, etc..". I'd say that in our use case it would be pretty difficult to bypass all the fences in place to gather some from the shuffler (basically running as rogue reducer).

elukey set the point value for this task to 21.Jul 5 2019, 2:29 PM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.