Page MenuHomePhabricator

Enable base::firewall on stat boxes after restricting Spark REPL ports.
Closed, ResolvedPublic5 Estimated Story Points

Description

In T170496, the fact that a non detached Spark yarn CLI needs to communicate with spark workers on an ephemeral port was mentioned. This keeps us from enabling base::firewall on stat boxes. We should investigate this, and see if we can restrict the possible ranges of ports that spark might use.

To reproduce: from stat1004 or stat1005, run either spark-shell --master yarn or pyspark --master yarn.

This will start a Spark REPL, with workers running in Hadoop. The Hadoop workers need a way to send results back to the local REPL, and since there can be many workers at once, the REPL listens on a ephemeral port. It also starts up a local web GUI on a port, but I believe this starts a a defined port, and then increments until it finds an unused one.

All Hadoop workers need to be able to talk to the REPL port, while (I think) only localhost needs to reach the web GUI port, since it is expected that this will be accessed by a ssh tunnel, if at all.

spark-shell and pyspark are part of the Spark packages provided by Cloudera.

Event Timeline

while (I think) only localhost needs to reach the web GUI port, since it is expected that this will be accessed by a ssh tunnel, if at all.

Currently i access this through the yarn.wikimedia.org proxy, basically clicking 'Application Master' from the running tasks lists. It generates URLs like https://yarn.wikimedia.org/proxy/application_1498042433999_98885/

Ahh, yeah maybe the GUI port is only for local mode.

If there is a deterministic range of ports this should be easy but we need to research it first.

Found something interesting in https://spark.apache.org/docs/2.3.1/configuration.html#networking

spark.port.maxRetries	16	Maximum number of retries when binding to a port before giving up. When a port is given a specific value (non 0), each subsequent retry will increment the port used in the previous attempt by 1 before retrying. This essentially allows it to try a range of ports from the start port specified to port + maxRetries.

In theory IIUC, we could do something like:

spark.driver.port: 5000
spark.blockManager.port: 6000
spark.port.maxRetries: 100

The above should allow 100 ports for executors and drivers to be used (the 100 is an example). It might be something worth to try to restrict the ports used by Spark and finally enable base firewall on a lot of nodes. @Ottomata thoughts?

elukey triaged this task as Medium priority.Jul 3 2019, 8:37 AM

Change 520683 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::spark2: allow a port range for the driver

https://gerrit.wikimedia.org/r/520683

Change 520683 merged by Elukey:
[operations/puppet@production] profile::hadoop::spark2: allow a port range for the driver

https://gerrit.wikimedia.org/r/520683

Change 520688 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: fix spark2 config

https://gerrit.wikimedia.org/r/520688

Change 520688 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: fix spark2 config

https://gerrit.wikimedia.org/r/520688

an-tool1006 seems to work fine with the new settings and ferm enabled! I opened pyspark2 and spark2-shell (both with --master yarn) and they bound to ports 12000 and 12001 as expected.

Change 520706 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::explorer: add base firewall

https://gerrit.wikimedia.org/r/520706

I checked via netstat all the ports opened on stat boxes, and I have a few comments:

  1. People seems to use a lot python and ipykernel_launcher on various ports.
  2. People have been used to test whatever they wanted on these boxes, so restricting ports might mean that all of a sudden things starts breaking or not behaving as expected.

Do they need those ports opened? Are we going to impact people testing stuff? My assumption is that all that users need is the listening port available in localhost, probably not much else. Before restricting ports we should do some homework to avoid impacting people when we enable ferm..

I can see from https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#firewall-setup that high ports are needed by Jupyter to communicate with its kernels running, but I am pretty sure that localhost is fine (that would be allowed enabling base firewall).

How would this impact the ability to install Python and R packages? Will export https_proxy=http://webproxy.eqiad.wmnet:8080 be enough for when we need to access the Internet to download a package or a file programmatically?

How would this impact the ability to install Python and R packages? Will export https_proxy=http://webproxy.eqiad.wmnet:8080 be enough for when we need to access the Internet to download a package or a file programmatically?

Hey Mikhail! The firewall should only restrict inbound traffic, the outbound traffic will be unaffected, so using pip or R packages should be completely fine.

Change 520706 merged by Elukey:
[operations/puppet@production] role::statistics::explorer: add base firewall

https://gerrit.wikimedia.org/r/520706

Mentioned in SAL (#wikimedia-operations) [2019-07-08T07:00:07Z] <elukey> add base::firewall to stat1004 - T170826

elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 521479 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable base::firewall on stat1007

https://gerrit.wikimedia.org/r/521479

Change 521479 merged by Elukey:
[operations/puppet@production] Enable base::firewall on stat1007

https://gerrit.wikimedia.org/r/521479

Change 521494 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::swap: enable base::firewall

https://gerrit.wikimedia.org/r/521494

Change 521494 abandoned by Elukey:
role::swap: enable base::firewall

Reason:
Splitting the change in two parts to exclude impact to users as much as possible

https://gerrit.wikimedia.org/r/521494

Change 521516 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::swap: restrict Spark driver port range

https://gerrit.wikimedia.org/r/521516

Change 521516 merged by Elukey:
[operations/puppet@production] role::swap: restrict Spark driver port range

https://gerrit.wikimedia.org/r/521516

Not sure if it's related, but starting this week when i start a pyspark repl on stat1007 connected to yarn it shows up in yarn.wikimedia.org, but the ApplicationMaster link just times out. Seems plausibly related to firewalls

Not sure if it's related, but starting this week when i start a pyspark repl on stat1007 connected to yarn it shows up in yarn.wikimedia.org, but the ApplicationMaster link just times out. Seems plausibly related to firewalls

Thanks a lot for the report! If I got it correctly, the Spark session works but it is only the Yarn link that times out right?

Found the problem: the spark.ui.port (starting from 4040) is not whitelisted by the firewall, and this is why yarn times out. I tried to create two separate spark sessions, and one got port 4040 and the other one 4041, so they seems to follow the same idea as the spark.driver.port.

Change 521900 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::spark2: specify the spark.ui.port and its firewall range

https://gerrit.wikimedia.org/r/521900

Change 521900 merged by Elukey:
[operations/puppet@production] profile::spark2: specify the spark.ui.port and its firewall range

https://gerrit.wikimedia.org/r/521900

The UI now works, unfortunately broadcasts appear to be broken in spark. Repro with the following, compare notebook1004 vs stat1007 running pyspark2 --master yarn. For me notebook1004 returns relatively quickly, and stat1007 hangs for a minute or two before failing.

bc = sc.broadcast("abcdefghijklmnopqrstuvwxyz")
sc.range(10).map(lambda i: bc.value[i]).collect()

Broadcasts go through the blockmanager. The spark docs say the default value of spark.blockManager.port is random. Indeed setting it to an allowable port with --conf spark.driver.blockManager.port=4050 seems to fix pyspark2 running from stat1004. This handles port failover much like the others, reporting the following when the port is in use:

19/07/10 22:26:46 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 4050. Attempting port 4051.

Per spark docs the ui + driver + blockManager are the only three ports we should need to worry about as I don't think we run the spark history server.

Very nice investigation, I was in fact trying to figure out the purpose of the last port and you solved it :) I'll make sure that port will get a range too!

Change 522024 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::spark2: limit the driver block manager's ports

https://gerrit.wikimedia.org/r/522024

Change 522024 merged by Elukey:
[operations/puppet@production] profile::hadoop::spark2: limit the driver block manager's ports

https://gerrit.wikimedia.org/r/522024

This is a pyspark2 session opened on stat1007:

elukey@stat1007:~$ sudo netstat -nlpt | grep 91543
tcp6       0      0 10.64.21.118:13000      :::*                    LISTEN      91543/java
tcp6       0      0 :::4040                 :::*                    LISTEN      91543/java
tcp6       0      0 127.0.0.1:41329         :::*                    LISTEN      91543/java
tcp6       0      0 10.64.21.118:12000      :::*                    LISTEN      91543/java

The only random port remaining is 41329 but it binds in localhost, the other ones are set and whitelisted via ferm. Should work now!

Change 522105 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::spark2: fix port range

https://gerrit.wikimedia.org/r/522105

Change 522105 merged by Elukey:
[operations/puppet@production] profile::hadoop::spark2: fix port range

https://gerrit.wikimedia.org/r/522105

Change 523090 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::swap: add profile::base::firewall

https://gerrit.wikimedia.org/r/523090

Change 523090 merged by Elukey:
[operations/puppet@production] role::swap: add profile::base::firewall

https://gerrit.wikimedia.org/r/523090

elukey set the point value for this task to 5.Jul 15 2019, 2:57 PM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.