Page MenuHomePhabricator

[Data Platform] Deploy Spark History Service
Closed, ResolvedPublic

Assigned To
Authored By
nfraison
Feb 21 2023, 4:10 PM
Referenced Files
F41598247: image.png
Dec 13 2023, 10:16 AM
F41522994: image.png
Nov 21 2023, 1:07 PM
F41522992: image.png
Nov 21 2023, 1:07 PM
F41522953: image.png
Nov 21 2023, 1:07 PM
Tokens
"Pterodactyl" token, awarded by xcollazo."Mountain of Wealth" token, awarded by JAllemandou.

Description

The Spark History Service allows cluster users the ability to examine the retrospective status of their jobs, including many details that faciliatate troubleshooting. Without this service we are losing valuable insights into the performance characteristics of both our production and ad-hoc cluster jobs.

Whilst it can be run as a standalone service, the most useful configuration for us is when the Spark History Service is integrated with the YARN job browser interface, which we run at https://yarn.wikimedia.org - We already have the spark UI available for running jobs, but once a task has finished we lose access to this information. By integrating it with the YARN job browser we can make this a unified interface for current and historical jobs.

The most useful information on the Spark History Service is here: https://spark.apache.org/docs/latest/monitoring.html

In terms of its architecture, it relies on spark jobs being instrumented in such a way that they write event log files (one per app) to a common directory on HDFS.
The history server then loads these jobs files and presents the information via its UI. The History server can also be responsible for log rotation and compression, if required.

We have decided that we would like to run the Spark History Server under Kubernetes on the DSE-K8S cluster.

We have a set of Spark images that we already use for the spark-operator, so we are adapting this image so this it is suitable for running this history server daemon too.

A number of other tasks remain to be completed, as per: https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service

  • Decide how to handle the test cluster - T351716

On the server side

  • Set up the kubeconfig files - T351711
  • Add two namespaces (spark-history and spark-history-test) - T351713
  • Create a docker image containing kerberos-related tooling - T352406
  • Create a helm-chart for spark-history - T351722
  • Define the two helmfile deployments - T352860
  • Grant permission to /var/log/spark for the associated principal - T352838
  • Add private data (i.e. the keytab) - T351816
  • Configure ingress to the services - T352639
  • Deploy the services - T352861
  • Ensure that suitable metrics are being gathered - T353694
  • Configure availability monitoring - T353717
  • Test visibility of spark job data - T352882
  • Configure the YARN resourcemanager with the history service URL - T352863
  • Document the service - T353232
  • Investigate Spark History Server silent errors when downloading some files from HDFS - T354777
  • Setup an appropriate retention policy - T354927
  • Tweak memory settings - T354929

On the client side

  • Configure the spark defaults with the required options - T352849

Acceptance Criteria

  • Spark history is accessible on the test cluster via SSH tunnelling to an-test-master*
  • Spark history is accessible on the production cluster via https://yarn.wikimedia.org

Related Objects

StatusSubtypeAssignedTask
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 896363 merged by Btullis:

[operations/docker-images/production-images@master] spark: add support for spark-history on the spark image

https://gerrit.wikimedia.org/r/896363

I have merged the patch to the spark images and I've triggered a manual rebuild of the production images repo, as per:
https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images

I see we have "Configure ingress to the services" as part of the list of things to do. Do we need to though? IIRC this will only be an internal tool, not exposed to the outside world, and only accessed by Yarn. So maybe the k8s service DNS should suffice?

I see we have "Configure ingress to the services" as part of the list of things to do. Do we need to though? IIRC this will only be an internal tool, not exposed to the outside world, and only accessed by Yarn. So maybe the k8s service DNS should suffice?

Oh yes, I see what you mean.
I only meant it to be ingress into the kubernetes cluster from elsewhere in production. Not public facing.

As an example, we have the datahub-gms service configured here with ingress enabled.
The service name we use for this private service is: datahub-gms.discovery.wmnet - The discovery part is because it's available in both DCs using discovery DNS)

I was thinking that maybe we would be using something like spark-history.svc.eqiad.wmnet for which I think we would need to have ingress enabled, but maybe it's not required if we use the k8s service DNS.
https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress

I'm just not sure that the k8s services addresses are resolvable by everything outside of the clusters, but I could be completely wrong about this.

Oh I see. Indeed, if we can't resolve the k8s service names outside of within the k8s cluster, then yes, we'd indeed need that. Point taken, thank you!

brouberol updated the task description. (Show Details)

Getting there. This is from an SSH port forwarding session in the test cluster.

image.png (531×1 px, 105 KB)

We're currently investigating who only four applications are shown, but this is a good start.

Change 984130 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Retrict access to the spark-history k8s API tokens

https://gerrit.wikimedia.org/r/984130

The server is available at https://yarn.wikimedia.org/history-server/. We will enable the collection of historical metrics for all spark jobs at the start of 2024, once we're out of code freeze. I have added documentation about the service in Wikitech.

Change 984130 merged by Btullis:

[operations/puppet@production] Retrict access to the spark-history k8s API tokens

https://gerrit.wikimedia.org/r/984130

brouberol updated the task description. (Show Details)

Change 989786 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/docker-images/production-images@master] Add base production images containing Java 8 JDK and JRE

https://gerrit.wikimedia.org/r/989786

Change 989786 merged by Btullis:

[operations/docker-images/production-images@master] Add base production images containing Java 8 JDK and JRE

https://gerrit.wikimedia.org/r/989786

We have fixed the last remaining issue with the Spark History Server. It is now considered live and operational, with 60d of retention. We expect the whole dataset of lz4 compressed event files to take about 2-3TB in HDFS altogether.

Please file a bug for any issue you encounter, and we'll address them ASAP.

I hope it's useful for y'all!

I hope it's useful for y'all!

Just passing by to thank you for this work! It will definitely make debugging easier, and also will make collaboration on debugging easier as well!