In T170496, the fact that a non detached Spark yarn CLI needs to communicate with spark workers on an ephemeral port was mentioned. This keeps us from enabling base::firewall on stat boxes. We should investigate this, and see if we can restrict the possible ranges of ports that spark might use.
To reproduce: from stat1004 or stat1005, run either spark-shell --master yarn or pyspark --master yarn.
This will start a Spark REPL, with workers running in Hadoop. The Hadoop workers need a way to send results back to the local REPL, and since there can be many workers at once, the REPL listens on a ephemeral port. It also starts up a local web GUI on a port, but I believe this starts a a defined port, and then increments until it finds an unused one.
All Hadoop workers need to be able to talk to the REPL port, while (I think) only localhost needs to reach the web GUI port, since it is expected that this will be accessed by a ssh tunnel, if at all.
spark-shell and pyspark are part of the Spark packages provided by Cloudera.