Page MenuHomePhabricator

Configure the YARN resource manager with the spark history service URL
Closed, ResolvedPublic

Assigned To
Authored By
brouberol
Dec 6 2023, 1:02 PM
Referenced Files
F41612943: image.png
Dec 19 2023, 9:54 AM
F41599068: image.png
Dec 13 2023, 7:10 PM
F41599066: image.png
Dec 13 2023, 7:10 PM
F41599057: image.png
Dec 13 2023, 7:10 PM
F41599038: image.png
Dec 13 2023, 7:10 PM
F41599029: image.png
Dec 13 2023, 7:10 PM

Description

Definition of done:

  • The test YARN interface points to the test spark-history service
  • The production YARN interface points to the spark-history service

Event Timeline

Change 981948 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] [yarn] Add the option to configure the spark history server address

https://gerrit.wikimedia.org/r/981948

Change 981949 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Configure the Spark History server host for the an-test yarn

https://gerrit.wikimedia.org/r/981949

Change 981950 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Configure the Spark History server host for the analytics yarn

https://gerrit.wikimedia.org/r/981950

Change 981948 merged by Brouberol:

[operations/puppet@production] [yarn] Add the option to configure the spark history server address

https://gerrit.wikimedia.org/r/981948

Change 981949 merged by Brouberol:

[operations/puppet@production] Configure the Spark History server host for the an-test yarn

https://gerrit.wikimedia.org/r/981949

Change 982656 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Revert "Configure the Spark History server host for the an-test yarn"

https://gerrit.wikimedia.org/r/982656

Change 982657 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Revert "[yarn] Add the option to configure the spark history server address"

https://gerrit.wikimedia.org/r/982657

Change 982797 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] spark3: add option to specify spark history server address to yarn

https://gerrit.wikimedia.org/r/982797

Change 982798 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] spark3: Specify the history server endoint for the test-analytics cluster

https://gerrit.wikimedia.org/r/982798

Change 982656 merged by Brouberol:

[operations/puppet@production] Revert "Configure the Spark History server host for the an-test yarn"

https://gerrit.wikimedia.org/r/982656

Change 982657 merged by Brouberol:

[operations/puppet@production] Revert "[yarn] Add the option to configure the spark history server address"

https://gerrit.wikimedia.org/r/982657

Change 982797 merged by Brouberol:

[operations/puppet@production] spark3: add option to specify spark history server address to yarn

https://gerrit.wikimedia.org/r/982797

Change 982798 merged by Brouberol:

[operations/puppet@production] spark3: Specify the history server endoint for the test-analytics cluster

https://gerrit.wikimedia.org/r/982798

This is proving a little tricky to test on the hadoop-test cluster, because we don't have ready access to a YARN job browser UI. However, I did the following little test which might be helpful.
I set up SSH access to the YARN UI in test with:

ssh -N -L 8088:an-test-master1001:8088 an-test-master1001.eqiad.wmnet

Note that we can't use localhost for this, as the port does not seem to be open on localhost.

Then I started a Jupyter notebook on an-test-client1002 with:

ssh -N an-test-client1002.eqiad.wmnet -L 8880:127.0.0.1:8880

I made sure that I have authenticated with kerberos, then I entered the following code into the notebook.

image.png (887×1 px, 119 KB)

import wmfdata as wmf
ss = wmf.spark.create_custom_session(
    master="yarn",
    spark_config={
        "spark.driver.memory": "2g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "8g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.yarn.historyServer.address": "yarn.wikimedia.org"
    }
)
ss.sql("""
SELECT count(1) as count
FROM wmf.webrequest
WHERE year = 2023
  AND month = 12
  AND day = 12
""").show(100)

I then checked the YARN UI for this spark session, which is still running, even though this particular query has finished:
We can see the application listed.

image.png (574×1 px, 132 KB)

When we click through, we can see that the ApplicationMaster link points to http://an-test-master1001.eqiad.wmnet:8088/proxy/application_1702472457465_0153/
image.png (794×1 px, 189 KB)

When I click on that link, my browser obviously can't get there, but if I manually enter http://localhost:8088/proxy/application_1702472457465_0153/ into my browser, then I can see the spark UI for the active session.
image.png (543×1 px, 54 KB)

Next I can stop the session by entering this into the Jupyter notebook:

ss.stop()

Now we can see that the History link is shown instead of ApplicationMaster, but the address appears to be the same: http://an-test-master1001.eqiad.wmnet:8088/proxy/application_1702472457465_0153/

image.png (796×1 px, 189 KB)

If I manually enter that address in the browser again, it redirects me to https://yarn.wikimedia.org/history/application_1702472457465_0153/1
I wouldn't be surprised if the HTTPS scheme here is enforced because of HSTS, but I haven't tested that.

So I think that we can add a redirect in the Apache config for the /history URL paths in the apache virtualhost config.

It looks like this has been done for the mapreduce server history here. But I haven't verified this yet.

Change 983192 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] spark3: set the spark history server domain as yarn.wikimedia.org

https://gerrit.wikimedia.org/r/983192

Change 983193 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] yarn: proxy the spark job history requests to the spark history service

https://gerrit.wikimedia.org/r/983193

Change 983193 abandoned by Brouberol:

[operations/puppet@production] yarn: proxy the spark job history requests to the spark history service

Reason:

Bundled with 983192 now

https://gerrit.wikimedia.org/r/983193

Change 983193 restored by Brouberol:

[operations/puppet@production] yarn: proxy the spark job history requests to the spark history service

https://gerrit.wikimedia.org/r/983193

Change 983192 merged by Brouberol:

[operations/puppet@production] spark3: set the spark history server domain for analytics-hadoop

https://gerrit.wikimedia.org/r/983192

We have setup a proxy_pass rule from https://yarn.wikimedia.org/history to https://spark-history.svc.eqiad:30443/history.

When we start spark jobs with spark.yarn.historyServer.address: yarn.wikimedia.org (which is evaluated as http://yarn.wikimedia.org), the http->https redirection will be taken care of by Apache, so:

We'll see if things work as expected.

Change 983712 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] yarn: configure Apache to only listen to port 80

https://gerrit.wikimedia.org/r/983712

Change 983712 merged by Brouberol:

[operations/puppet@production] yarn: configure Apache to only listen to port 80

https://gerrit.wikimedia.org/r/983712

Change 983748 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-history: enable definition of spark env vars in spark-env.sh

https://gerrit.wikimedia.org/r/983748

Change 983749 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-history: set public DNS to yarn.wikimedia.org

https://gerrit.wikimedia.org/r/983749

Change 983748 abandoned by Brouberol:

[operations/deployment-charts@master] spark-history: enable definition of spark env vars in spark-env.sh

Reason:

Experimentation has shown this does not work

https://gerrit.wikimedia.org/r/983748

Change 984127 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-history-analytics-hadoop: fix redirect and static links

https://gerrit.wikimedia.org/r/984127

Change 984128 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] httpd-yarn: proxy reqs with a /spark-history prefix to the spark-history svc

https://gerrit.wikimedia.org/r/984128

Change 983749 abandoned by Brouberol:

[operations/deployment-charts@master] spark-history: set public DNS to yarn.wikimedia.org

Reason:

Experiment has shown that this does not work as expected

https://gerrit.wikimedia.org/r/983749

Change 984127 merged by Brouberol:

[operations/deployment-charts@master] spark-history-analytics-hadoop: fix redirect and static links

https://gerrit.wikimedia.org/r/984127

Change 984128 merged by Brouberol:

[operations/puppet@production] httpd-yarn: proxy reqs with a /spark-history prefix to the spark-history svc

https://gerrit.wikimedia.org/r/984128

We had to slightly tweak the spark UI config as well as the apache config to make the whole thing work:

  • we added spark.ui.proxyBase: /spark-history spark parameter to tell spark to prepend all its URLs with the /spark-history prefix
  • we added the following proxy stanzas to apache to tell it to proxy all requests with URLs starting with /spark-history to the spark-history service
ProxyPass /spark-history/ https://spark-history.svc.eqiad.wmnet:30443/
ProxyPassReverse /spark-history/ https://spark-history.svc.eqiad.wmnet:30443/

This fixed the serving of the spark history statics.

We also added the spark.ui.proxyRedirectUri: https://yarn.wikimedia.org/ spark parameter to tell spark to use the https://yarn.wikimedia.org/ base URL for all redirections, to make sure they get served / proxied by apache.

And with this, a Yarn link to the spark history server (eg https://yarn.wikimedia.org/proxy/application_1695896957545_507754/) now ping-pongs between redirections until the browser gets redirected to https://yarn.wikimedia.org/spark-history/history/application_1695896957545_507754/jobs/, which displays the following:

image.png (758×3 px, 114 KB)

Change 983193 abandoned by Brouberol:

[operations/puppet@production] yarn: proxy the spark job history requests to the spark history service

Reason:

Already released

https://gerrit.wikimedia.org/r/983193

Change 981950 abandoned by Brouberol:

[operations/puppet@production] Configure the Spark History server host for the analytics yarn

Reason:

Moot

https://gerrit.wikimedia.org/r/981950