⚓ T330176 [Data Platform] Deploy Spark History Service

Subject	Repo	Branch	Lines +/-
Add base production images containing Java 8 JDK and JRE	operations/docker-images/production-images	master	+35 -2
Retrict access to the spark-history k8s API tokens	operations/puppet	production	+8 -0
spark: add support for spark-history on the spark image	operations/docker-images/production-images	master	+57 -2

Status	Assigned	Task
Resolved	brouberol	T330176 [Data Platform] Deploy Spark History Service
Resolved	BTullis	T351716 Decide how to handle the spark-history service for the test cluster
Resolved	brouberol	T351713 Add a namespace (or namespaces) for the spark-history service
Resolved	brouberol	T351711 Set up kubeconfig files for spark-history
Resolved	brouberol	T351722 Create a helm chart for the spark-history service
Resolved	brouberol	T351816 Create a keytab for each spark-history-server and add it to the puppet secret hieradata
Resolved	brouberol	T352406 Define a docker image containing kerberos-related tooling
Resolved	brouberol	T352639 Configure ingress to the spark history servers
Resolved	BTullis	T352838 Configure appropriate permissions for the /var/log/spark HDFS directory
Resolved	brouberol	T352850 Build an image for spark-history with user uid=909
Resolved	brouberol	T352849 Configure the spark event dir in the spark3 defaults
Resolved	brouberol	T352860 Define the spark history service helmfile deployments
Resolved	brouberol	T352861 Deploy the spark history services
Resolved	brouberol	T352863 Configure the YARN resource manager with the spark history service URL
Resolved	brouberol	T352882 Run a spark job in test to make sure the history server can see the job data
Resolved	brouberol	T353232 Document how to browse the History server locally
Resolved	brouberol	T353694 Collect metrics from the spark-history server
Resolved	brouberol	T353717 Monitor the availability of the spark history server deployments
Resolved	brouberol	T354777 Investigate Spark History Server silent errors when downloading some files from HDFS
Resolved	brouberol	T354785 Pin docker images to an explicit tag instead of using latest
Resolved	brouberol	T354927 Setup an appropriate retention policy
Resolved	brouberol	T354929 Tweak Spark History memory settings

BTullis updated the task description. (Show Details)Nov 22 2023, 11:51 AM

Change 896363 merged by Btullis:

[operations/docker-images/production-images@master] spark: add support for spark-history on the spark image

https://gerrit.wikimedia.org/r/896363

Maintenance_bot removed a project: Patch-For-Review.Nov 22 2023, 12:30 PM

I have merged the patch to the spark images and I've triggered a manual rebuild of the production images repo, as per:
https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images

brouberol updated the task description. (Show Details)Nov 22 2023, 1:34 PM

brouberol closed subtask T351713: Add a namespace (or namespaces) for the spark-history service as Resolved.Nov 24 2023, 1:20 PM

brouberol updated the task description. (Show Details)

brouberol closed subtask T351711: Set up kubeconfig files for spark-history as Resolved.Nov 27 2023, 8:44 AM

brouberol updated the task description. (Show Details)Nov 27 2023, 9:31 AM

BTullis reassigned this task from BTullis to brouberol.Nov 28 2023, 2:57 PM

brouberol closed subtask T351816: Create a keytab for each spark-history-server and add it to the puppet secret hieradata as Resolved.Nov 28 2023, 4:33 PM

brouberol reopened subtask T351816: Create a keytab for each spark-history-server and add it to the puppet secret hieradata as Open.Nov 28 2023, 4:44 PM

brouberol updated the task description. (Show Details)Nov 30 2023, 11:41 AM

brouberol updated the task description. (Show Details)Nov 30 2023, 1:44 PM

brouberol closed subtask T351816: Create a keytab for each spark-history-server and add it to the puppet secret hieradata as Resolved.Nov 30 2023, 1:49 PM

brouberol moved this task from In Progress to Quarterly Goals on the Data-Platform-SRE board.

I see we have "Configure ingress to the services" as part of the list of things to do. Do we need to though? IIRC this will only be an internal tool, not exposed to the outside world, and only accessed by Yarn. So maybe the k8s service DNS should suffice?

In T330176#9371468, @brouberol wrote:

I see we have "Configure ingress to the services" as part of the list of things to do. Do we need to though? IIRC this will only be an internal tool, not exposed to the outside world, and only accessed by Yarn. So maybe the k8s service DNS should suffice?

Oh yes, I see what you mean.
I only meant it to be ingress into the kubernetes cluster from elsewhere in production. Not public facing.

As an example, we have the datahub-gms service configured here with ingress enabled.
The service name we use for this private service is: datahub-gms.discovery.wmnet - The discovery part is because it's available in both DCs using discovery DNS)

I was thinking that maybe we would be using something like spark-history.svc.eqiad.wmnet for which I think we would need to have ingress enabled, but maybe it's not required if we use the k8s service DNS.
https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress

I'm just not sure that the k8s services addresses are resolvable by everything outside of the clusters, but I could be completely wrong about this.

Oh I see. Indeed, if we can't resolve the k8s service names outside of within the k8s cluster, then yes, we'd indeed need that. Point taken, thank you!

brouberol closed subtask T352406: Define a docker image containing kerberos-related tooling as Resolved.Dec 1 2023, 9:33 AM

brouberol updated the task description. (Show Details)

lbowmaker edited projects, added Data-Engineering (Sprint 6); removed Data-Engineering (Sprint 5).Dec 1 2023, 1:48 PM

lbowmaker moved this task from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 6) board.

brouberol updated the task description. (Show Details)Dec 4 2023, 9:20 AM

brouberol updated the task description. (Show Details)Dec 6 2023, 8:47 AM

brouberol updated the task description. (Show Details)

brouberol updated the task description. (Show Details)Dec 6 2023, 10:41 AM

brouberol edited projects, added Data-Platform-SRE (2023.12.01 - 2023.12.31); removed Data-Platform-SRE.Dec 6 2023, 1:03 PM

brouberol updated the task description. (Show Details)

brouberol edited projects, added Data-Platform-SRE; removed Data-Platform-SRE (2023.12.01 - 2023.12.31).

brouberol updated the task description. (Show Details)Dec 6 2023, 4:37 PM

brouberol closed subtask T352838: Configure appropriate permissions for the /var/log/spark HDFS directory as Resolved.Dec 7 2023, 3:08 PM

brouberol reopened subtask T352838: Configure appropriate permissions for the /var/log/spark HDFS directory as Open.Dec 7 2023, 4:38 PM

BTullis mentioned this in T297231: [Data Quality] Sending Apache Spark metrics to PushGateway.Dec 8 2023, 12:41 PM

brouberol closed subtask T352639: Configure ingress to the spark history servers as Resolved.Dec 11 2023, 5:20 PM

brouberol updated the task description. (Show Details)Dec 12 2023, 8:23 AM

brouberol closed subtask T352882: Run a spark job in test to make sure the history server can see the job data as Resolved.Dec 12 2023, 1:08 PM

Getting there. This is from an SSH port forwarding session in the test cluster.

We're currently investigating who only four applications are shown, but this is a good start.

brouberol reopened subtask T351816: Create a keytab for each spark-history-server and add it to the puppet secret hieradata as Open.Dec 14 2023, 4:03 PM

brouberol updated the task description. (Show Details)Dec 14 2023, 4:10 PM

brouberol closed subtask T351722: Create a helm chart for the spark-history service as Resolved.Dec 15 2023, 8:26 AM

brouberol closed subtask T352860: Define the spark history service helmfile deployments as Resolved.

brouberol updated the task description. (Show Details)

brouberol closed subtask T351816: Create a keytab for each spark-history-server and add it to the puppet secret hieradata as Resolved.Dec 15 2023, 12:30 PM

brouberol updated the task description. (Show Details)

brouberol closed subtask T352861: Deploy the spark history services as Resolved.Dec 15 2023, 2:49 PM

brouberol updated the task description. (Show Details)

BTullis closed subtask T352838: Configure appropriate permissions for the /var/log/spark HDFS directory as Resolved.Dec 15 2023, 3:06 PM

brouberol updated the task description. (Show Details)Dec 18 2023, 8:30 AM

brouberol closed subtask T352863: Configure the YARN resource manager with the spark history service URL as Resolved.Dec 19 2023, 9:54 AM

brouberol updated the task description. (Show Details)

Change 984130 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Retrict access to the spark-history k8s API tokens

https://gerrit.wikimedia.org/r/984130

gerritbot added a project: Patch-For-Review.Dec 19 2023, 10:06 AM

brouberol updated the task description. (Show Details)Dec 19 2023, 10:56 AM

brouberol updated the task description. (Show Details)Dec 19 2023, 3:14 PM

brouberol closed subtask T353232: Document how to browse the History server locally as Resolved.Dec 20 2023, 9:29 AM

The server is available at https://yarn.wikimedia.org/history-server/. We will enable the collection of historical metrics for all spark jobs at the start of 2024, once we're out of code freeze. I have added documentation about the service in Wikitech.

brouberol closed subtask T353694: Collect metrics from the spark-history server as Resolved.Jan 2 2024, 9:57 AM

brouberol closed subtask T353717: Monitor the availability of the spark history server deployments as Resolved.

Change 984130 merged by Btullis:

[operations/puppet@production] Retrict access to the spark-history k8s API tokens

https://gerrit.wikimedia.org/r/984130

brouberol updated the task description. (Show Details)Jan 2 2024, 3:25 PM

brouberol updated the task description. (Show Details)

lbowmaker edited projects, added Data-Engineering (Sprint 7); removed Data-Engineering (Sprint 6).Jan 8 2024, 7:51 PM

lbowmaker moved this task from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 7) board.

brouberol updated the task description. (Show Details)Jan 10 2024, 4:16 PM

brouberol closed subtask T354785: Pin docker images to an explicit tag instead of using latest as Resolved.Jan 10 2024, 6:37 PM

Change 989786 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/docker-images/production-images@master] Add base production images containing Java 8 JDK and JRE

https://gerrit.wikimedia.org/r/989786

brouberol closed subtask T352849: Configure the spark event dir in the spark3 defaults as Resolved.Jan 11 2024, 3:00 PM

brouberol updated the task description. (Show Details)

brouberol updated the task description. (Show Details)Jan 12 2024, 8:09 AM

brouberol updated the task description. (Show Details)Jan 12 2024, 8:12 AM

brouberol updated the task description. (Show Details)Jan 12 2024, 8:17 AM

brouberol closed subtask T354929: Tweak Spark History memory settings as Resolved.Jan 12 2024, 12:06 PM

brouberol updated the task description. (Show Details)

brouberol updated the task description. (Show Details)Jan 15 2024, 1:03 PM

brouberol closed subtask T354927: Setup an appropriate retention policy as Resolved.Jan 15 2024, 3:26 PM

Change 989786 merged by Btullis:

[operations/docker-images/production-images@master] Add base production images containing Java 8 JDK and JRE

https://gerrit.wikimedia.org/r/989786

brouberol closed subtask T354777: Investigate Spark History Server silent errors when downloading some files from HDFS as Resolved.Jan 18 2024, 8:48 AM

brouberol updated the task description. (Show Details)

We have fixed the last remaining issue with the Spark History Server. It is now considered live and operational, with 60d of retention. We expect the whole dataset of lz4 compressed event files to take about 2-3TB in HDFS altogether.

Please file a bug for any issue you encounter, and we'll address them ASAP.

I hope it's useful for y'all!

brouberol closed this task as Resolved.Jan 18 2024, 8:50 AM

I hope it's useful for y'all!

Just passing by to thank you for this work! It will definitely make debugging easier, and also will make collaboration on debugging easier as well!

Thanks for the kind words @xcollazo!

[Data Platform] Deploy Spark History Service
Closed, ResolvedPublic
Actions

Description

On the server side

On the client side

Acceptance Criteria

Details

Related Objects
Search...

Event Timeline

	• nfraison
	Feb 21 2023, 4:10 PM

	F41598247: image.png
	Dec 13 2023, 10:16 AM

	F41522994: image.png
	Nov 21 2023, 1:07 PM

	F41522992: image.png
	Nov 21 2023, 1:07 PM

[Data Platform] Deploy Spark History ServiceClosed, ResolvedPublicActions

Description

On the server side

On the client side

Acceptance Criteria

Details

Related ObjectsSearch...

Event Timeline

[Data Platform] Deploy Spark History Service
Closed, ResolvedPublic
Actions

Related Objects
Search...