Page MenuHomePhabricator

Enable encryption in Spark 2.4 by default
Closed, ResolvedPublic

Description

After Kerberos and encryption of the Hadoop RPC protocol we should also move Spark's config to full encryption between workers.

Should be something like:

profile::hadoop::spark2::extra_settings:
  spark.authenticate: true
  spark.network.crypto.enabled: true
  spark.network.crypto.keyLength: 128
  spark.network.crypto.keyFactoryAlgorithm: PBKDF2WithHmacSHA1
  spark.io.encryption.enabled: true
  spark.io.encryption.keySizeBits: 128
  spark.io.encryption.keygen.algorithm: HmacSHA1
  spark.network.crypto.saslFallback: false

yarn_site_extra_config:
  spark.authenticate: true

The spark settings need to be present in all nodes using spark to work properly.

SASL encryption is considered deprecated after Spark 2.2, so the following should not be added:

spark.authenticate.enableSaslEncryption: true
spark.network.sasl.serverAlwaysEncrypt: true

Oozie workflows also need spark.authenticate: true since we set Spark settings in there (spark-defaults.conf is not read by Oozie)

Event Timeline

Change 558453 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add Spark encryption settings to the Hadoop test cluster

https://gerrit.wikimedia.org/r/558453

Change 558453 merged by Elukey:
[operations/puppet@production] Add Spark encryption settings to the Hadoop test cluster

https://gerrit.wikimedia.org/r/558453

Change 558563 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable Spark RPC encryption for the Yarn shuffler

https://gerrit.wikimedia.org/r/558563

Change 558563 merged by Elukey:
[operations/puppet@production] Enable Spark RPC encryption for the Yarn shuffler in Hadoop test

https://gerrit.wikimedia.org/r/558563

Change 558601 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add Spark encryption settings to the Hadoop test coordinator

https://gerrit.wikimedia.org/r/558601

Change 558601 merged by Elukey:
[operations/puppet@production] Add Spark encryption settings to the Hadoop test coordinator

https://gerrit.wikimedia.org/r/558601

Tested the following:

  • spark2-shell --master yarn with spark.sql("SELECT * FROM wmf.webrequest where year=2019 and month=12 and day=16 and hour=0 limit 10").show();
  • same as above but without --master yarn

I replicated the spark tests originally devised for kerberos and found 3 error cases:

  • spark2-submit in yarn mode for python script

Repro:

spark2-submit \
--master yarn \
/home/joal/test_spark_submit/spark-2.4.4-bin-hadoop2.6/examples/src/main/python/pi.py \
100

Error: Job starts correctly and the gets stuck repeating

19/12/18 20:34:27 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
19/12/18 20:34:27 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors!
19/12/18 20:34:28 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
19/12/18 20:34:28 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors!
...
  • pyspark2 in yarn mode (seems the same issue as the previous one, error is the same)
  • oozie spark action in scala

Repro: Run dedicated oozie test job, which fails, then look at logs

sudo -u analytics oozie job --oozie $OOZIE_URL   -Dname_node=hdfs://analytics-test-hadoop   -Drefinery_directory=hdfs://analytics-test-hadoop$(hdfs dfs -ls -d /wmf/refinery/$(date +%Y)* | tail -n 1 | awk '{print $NF}')   -Doozie_directory=hdfs://analytics-test-hadoop/user/joal/oozie   -Dqueue_name=production   -Doozie_launcher_queue_name=production   -Dstart_time=2019-08-26T00:00Z  -Dstop_time=2019-08-30T00:00Z -Dspark_job_jar=hdfs://analytics-test-hadoop/user/joal/refinery-job-0.0.109-SNAPSHOT.jar   -Dperiod_days=3   -Dsplit_by_os=false   -Doutput_directory=hdfs://analytics-test-hadoop/wmf/data/test_oozie_spark   -config /home/joal/refinery/oozie/mobile_apps/session_metrics/coordinator.properties   -run

Error:

2019-12-18 20:20:15,653 [dispatcher-event-loop-1] ERROR org.apache.spark.storage.BlockManager  - Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds...
java.lang.RuntimeException: java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)

I think maybe an oozie restart could do the trick (to pick the new spark-config). Let's try.

Other cases have worked for me (spark2-submit (scala) in local/yarn mode, spark2-shell in local/yarn mode).

Change 559418 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove Spark SASL encryption options from Hadoop test

https://gerrit.wikimedia.org/r/559418

Change 559418 merged by Elukey:
[operations/puppet@production] Remove Spark SASL encryption options from Hadoop test

https://gerrit.wikimedia.org/r/559418

Change 559488 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Disable SASL fallback for Yarn Spark Shuffle service in Hadoop test

https://gerrit.wikimedia.org/r/559488

Change 559488 merged by Elukey:
[operations/puppet@production] Disable SASL fallback for Yarn Spark Shuffle service in Hadoop test

https://gerrit.wikimedia.org/r/559488

It seems that in the Hadoop test cluster the jobs are all now logging the following over and over before executing and completing:

WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors!

The resources on the test cluster seem available, it is not really clear why this happens. In the node manager's log there are errors like:

2019-12-19 14:41:07,308 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 5477041805314078655
java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)

In theory those should not happen, since custom RPC+AES should be used. Tried also to add spark.network.crypto.saslFallback to yarn-site config but didn't make it work.

Change 559517 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Sync Yarn and Spark2 encryption config in Hadoop Test

https://gerrit.wikimedia.org/r/559517

Change 559517 merged by Elukey:
[operations/puppet@production] Sync Yarn and Spark2 encryption config in Hadoop Test

https://gerrit.wikimedia.org/r/559517

fdans moved this task from Incoming to Operational Excellence on the Analytics board.

I replicated the spark tests originally devised for kerberos and found 3 error cases:

  • spark2-submit in yarn mode for python script

Repro:

spark2-submit \
--master yarn \
/home/joal/test_spark_submit/spark-2.4.4-bin-hadoop2.6/examples/src/main/python/pi.py \
100

This now works fine in hadoop test, just ran multiple submits and never hit the previous error..

@EBernhardson hi! I am looping you in since you are our top spark user :D

We are testing encryption for Spark RPCs in Hadoop test, and excluding some Heisenbugs we are in a good stage. The procedure to enable Spark encryption is a bit invasive, for example oozie spark actions will need to get new settings otherwise they will not be able to talk with the Yarn Spark Shuffler.

Do you have a preference about how to test/roll-out this for your jobs? I'd like to avoid impacting any production job that your team relies on..

@joal I found https://www.ericlin.me/2018/06/oozie-spark-action-not-loading-spark-configurations/ today, there is an option listed that seems good to test:

<property>
    <name>oozie.service.SparkConfigurationService.spark.configurations</name>
    <value>*=/etc/spark/conf</value>
</property>

The link https://oozie.apache.org/docs/4.1.0/oozie-default.xml doesn't list it (only from 4.2) but https://archive.cloudera.com/cdh5/cdh/5/oozie/oozie-default.xml does:

Comma separated AUTHORITY=SPARK_CONF_DIR, where AUTHORITY is the HOST:PORT of the ResourceManager of a YARN cluster. The wildcard '*' configuration is used when there is no exact match for an authority. The SPARK_CONF_DIR contains the relevant spark-defaults.conf properties file. If the path is relative is looked within the Oozie configuration directory; though the path can be absolute. This is only used when the Spark master is set to either "yarn-client" or "yarn-cluster".

Default is *=spark-conf, so we should use something like *=/etc/spark2/conf ?

Applied manually on analytics1030 in Hadoop test:

2020-01-09 16:08:58,351  INFO SparkConfigurationService:520 - SERVER[analytics1030.eqiad.wmnet] Loaded Spark Configuration: *=/etc/spark2/conf/spark-defaults.conf

Latest development on my end:

  • Oozie worked with Luca's patch above
  • spark-submit with python worked as well
  • pyspark2 still fails

More investigations tomorrow

Change 563414 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] oozie: add spark conf directory in oozie-site.xml

https://gerrit.wikimedia.org/r/563414

Change 563414 merged by Elukey:
[operations/puppet@production] oozie: add spark conf directory in oozie-site.xml

https://gerrit.wikimedia.org/r/563414

Just restarted oozie, it picked up the new config so in theory we'll not need to change any workflows when enabling spark encryption.

Change 563521 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove spark-specific options from Hadoop Test's Yarn config

https://gerrit.wikimedia.org/r/563521

Change 563521 merged by Elukey:
[operations/puppet@production] Remove spark-specific options from Hadoop Test's Yarn config

https://gerrit.wikimedia.org/r/563521

All the symptoms of the last issue to solve are highlighted in https://issues.apache.org/jira/browse/SPARK-19528

The application master in yarn seems to be created, it gets to running but then there seems to be a problem with Yarn Spark shufflers.

I checked https://github.com/apache/spark/blob/branch-2.4/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java and the only option mentioned is spark.authenticate, as most of the tutorials show (the other ones should be inherited by the ones set in the driver IIUC).

It is mentioned sasl though: import org.apache.spark.network.sasl.ShuffleSecretManager

So I tried to disable spark crypto via AES and set only sasl in the Driver's defaults, but the session fails with horrible errors.

Change 563651 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add spark encryption option to Hadoop test's yarn configuration

https://gerrit.wikimedia.org/r/563651

Change 563651 merged by Elukey:
[operations/puppet@production] Add spark encryption option to Hadoop test's yarn configuration

https://gerrit.wikimedia.org/r/563651

Really interesting: after leaving only 'spark.authenticate=true' in the yarn's config, Spark refine started failing with:

Failure(org.wikimedia.analytics.refinery.job.refine.RefineTargetException: Failed refinement of hdfs://analytics-test-hadoop/wmf/data/raw/eventlogging/eventlogging_NavigationTiming/hourly/2020/01/10/17 -> `event`.`NavigationTiming` (year=2020,month=1,day=10,hour=17). Original exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, analytics1036.eqiad.wmnet, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)
    at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69)
    at org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:90)

spark.network.crypto.enabled=true solved the problem (added to Yarn's config), so the Shuffler can effectively handle AES encryption as far as I can see (since we block SASL in the driver's settings).

Today I started with spark2-submit --conf spark.io.encryption.enabled=false --conf=spark.network.crypto.enabled=false --conf spark.dynamicAllocation.enabled=false --conf spark.shuffle.service.enabled=false --master yarn /home/joal/test_spark_submit/spark-2.4.4-bin-hadoop2.6/examples/src/main/python/pi.py 100 and remove one option at the time until the problem came up. I narrowed down the issue to spark.io.encryption.enabled, that is the option enabling AES encryption rather than SASL.

What I am wondering now is if our set up (Spark2 packaging + jars etc..) includes what it is needed to make AES encryption to work.

Starting point could be https://dzone.com/articles/apache-hadoop-code-quality-production-vs-test and https://issues.apache.org/jira/browse/SPARK-5682 to get some info.

commons-crypto 1.0.0 is contained in the spark-assembly jar on HDFS, but possibly this is only a Python issue with crypto libraries?

Today I can't repro anymore, pyspark --master yarn + spark-submit all work fine.. Could it be some weird capacity issue with dynamic allocation that happens in Hadoop test only under certain conditions? I'd be inclined to test this in Hadoop Analytics and see if it works, rollback is very quick and we'd have more datapoints to debug further..

Change 564562 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Increase Spark's crypto settings in Hadoop test

https://gerrit.wikimedia.org/r/564562

Change 564562 merged by Elukey:
[operations/puppet@production] Increase Spark's crypto settings in Hadoop test

https://gerrit.wikimedia.org/r/564562

Comparing logs of the same application id, with one container successfully registering to the AM and the other one not (causing the failure):

20/01/14 09:45:21 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1576771377404_19608_000001
20/01/14 09:45:21 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 09:45:52 ERROR TransportClientFactory: Exception while bootstrapping client after 30120 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
        at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:263)
        at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105)
        at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:257)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
        at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
        at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:259)
        ... 11 more

vs

20/01/14 09:45:58 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 09:46:14 INFO TransportClientFactory: Successfully created connection to an-tool1006.eqiad.wmnet/10.64.5.32:12000 after 16195 ms (16113 ms spent in bootstraps)

In the code it is mentioned a timeout:

https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java#L106

https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L129-L133

It seems 30s, more or less what I can see in Yarn's logs.. If I add --conf spark.network.auth.rpcTimeout=300 or similar to spark-submit I can see higher timings in the logs, and less timeouts. Without it the 30s timeouts can be easily seen across AM creation attempts.

The option spark.network.auth.rpcTimeout seems not highlighted in the Spark docs. IIUC when the AM is bootstrapped it tries to authenticate with the Driver via encrypted RPC, but in Python case it takes a lot and it may cause timeouts. This in turn causes Yarn to retry with another AM, until the maximum number is reached.

Example of app logs without the option set:

elukey@an-tool1006:~$ yarn logs -applicationId application_1576771377404_19628 | grep -A 2 "Registering the ApplicationMaster"
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
20/01/14 10:36:46 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:37:16 ERROR TransportClientFactory: Exception while bootstrapping client after 30101 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
--
20/01/14 10:36:09 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:36:39 ERROR TransportClientFactory: Exception while bootstrapping client after 30101 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
--
20/01/14 10:37:23 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:37:53 ERROR TransportClientFactory: Exception while bootstrapping client after 30121 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
--
20/01/14 10:39:14 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:39:44 ERROR TransportClientFactory: Exception while bootstrapping client after 30093 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
--
20/01/14 10:38:37 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:39:07 ERROR TransportClientFactory: Exception while bootstrapping client after 30115 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
--
20/01/14 10:38:00 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:38:30 ERROR TransportClientFactory: Exception while bootstrapping client after 30098 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.

5 timeouts of ~30s each and eventually final state FAILED.

With --conf spark.network.auth.rpcTimeout=300:

...skipping...
20/01/14 10:41:49 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:43:49 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
20/01/14 10:43:49 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from an-tool1006.eqiad.wmnet:12000 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout

20/01/14 10:43:55 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:45:55 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
20/01/14 10:45:55 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout

20/01/14 10:46:01 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 10:46:55 INFO TransportClientFactory: Successfully created connection to an-tool1006.eqiad.wmnet/10.64.5.32:12000 after 53309 ms (53221 ms spent in bootstraps)

And finally another try with --conf spark.network.auth.rpcTimeout --conf spark.network.timeout=300:

20/01/14 10:56:18 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 11:01:18 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
20/01/14 11:01:18 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from an-tool1006.eqiad.wmnet:12000 in 300 seconds. This timeout is controlled by spark.network.timeout
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)

20/01/14 11:01:24 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 11:01:35 INFO TransportClientFactory: Successfully created connection to an-tool1006.eqiad.wmnet/10.64.5.32:12000 after 10656 ms (10570 ms spent in bootstraps)

Maybe material for a github issue? 2.4.4 is very new and there might be issues with encryption..

Sent an email to users@spark.apache.org, let's see if anybody comes back with suggestions!

ok so today I found in the debug logs a warning that was indicating the failure to load openssl's crypto libs, and the fallback to standard JCE crypto. After a bit of digging I found this: https://issues.apache.org/jira/browse/HADOOP-12845

elukey@an-worker1080:~$ hadoop checknative
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
20/01/20 15:27:37 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
20/01/20 15:27:37 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib/x86_64-linux-gnu/libz.so.1
snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib/x86_64-linux-gnu/libbz2.so.1
openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared object file: No such file or directory)!

Debian adds the libcrypto.so symlink with libssl-dev, but currently only for libssl1.1.0, that ends up with:

openssl: false EVP_CIPHER_CTX_cleanup

This is probably a function that was present in libssl.1.0.2 but not in libssl1.1.0. In Hadoop test's worker/client nodes I created the following: sudo ln -s /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.2 /usr/lib/x86_64-linux-gnu/libcrypto.so, ending up in:

elukey@an-tool1006:~$ hadoop checknative -a
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
20/01/20 15:30:14 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
20/01/20 15:30:14 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib/x86_64-linux-gnu/libz.so.1
snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib/x86_64-linux-gnu/libbz2.so.1
openssl: true /usr/lib/x86_64-linux-gnu/libcrypto.so

Since then I cannot repro anymore the bug. I also had a chat with Moritz and the openssl vs Java JCE option seems better for various reasons, so one solution to unblock the rollout could be to add the symlink via puppet. It will affect also the yarn shuffler's TLS encryption, probably increasing a bit its perf (due to openssl's native code paths for some architectures).

Change 566062 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::common: add libcrypto.so link

https://gerrit.wikimedia.org/r/566062

Very interesting that the heisenbug seems now only triggering a warning, but not stopping pyspark:

elukey@an-tool1006:~$ spark2-submit --master yarn /home/joal/test_spark_submit/spark-2.4.4-bin-hadoop2.6/examples/src/main/python/pi.py 100
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
20/01/20 15:52:08 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Pi is roughly 3.142704

Change 566231 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Spark2 encryption options as default for Hadoop

https://gerrit.wikimedia.org/r/566231

Status: from my tests everything seems working fine, but this delicate change should probably be applied after all hands to avoid headaches :)

Next steps:

Review/Merge https://gerrit.wikimedia.org/r/566062 https://gerrit.wikimedia.org/r/566231

this delicate change should probably be applied before all hands to avoid headaches

You mean after all hands? :)

this delicate change should probably be applied before all hands to avoid headaches

You mean after all hands? :)

Yes just corrected :)

Change 566062 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::common: add libcrypto.so link

https://gerrit.wikimedia.org/r/566062

Change 566231 merged by Elukey:
[operations/puppet@production] Set Spark2 encryption options as default for Hadoop

https://gerrit.wikimedia.org/r/566231

Change 569530 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] spark: remove spark.io.encryption settings from defaults

https://gerrit.wikimedia.org/r/569530

Change 569530 merged by Elukey:
[operations/puppet@production] spark: remove spark.io.encryption settings from defaults

https://gerrit.wikimedia.org/r/569530

elukey set Final Story Points to 13.

Due to some issues with Spark Refine, we removed the spark.io.encryption settings (encryption of temporary shuffle files, spilled to disk) since it seems not working properly on this version of Spark 2.4. The important part was the RPC encryption (to ensure that traffic between nodes of the cluster is secured and authenticated), meanwhile security of on disk files is still a desirable but not mandatory requirement at the moment (HDFS blocks are not encrypted either on the Hadoop worker nodes, but those nodes are accessible only by Analytics and SRE).

Some references of bug reports to keep an eye on:
https://issues.apache.org/jira/browse/SPARK-30225