Submit a spark job to the dse-k8s cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• EChetty
	Sep 29 2022, 12:03 PM

Description

We need to be able to verify that we are able to create SparkApplication object in Kubernetes and have it create the necessary driver and executor pods.

At this stage it does not matter whether or not the job consumes any input or creates any output, it is simply to test the correct functioning of the spark-operator.

Details

Other Assignee: • nfraison

	Subject	Repo	Branch	Lines +/-
	spark: Authorize driver and executor pods to communicate	operations/deployment-charts	master	+24 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		In Progress		None	T318712 Enable spark jobs on the dse-k8s cluster via the spark-operator
		Resolved		BTullis	T318924 Submit a spark job to the dse-k8s cluster

Event Timeline

• EChetty created this task.Sep 29 2022, 12:03 PM

• EChetty moved this task from Sprint 02 to Sprint 03 on the Shared-Data-Infrastructure board.Oct 18 2022, 1:25 PM

• EChetty edited projects, added Shared-Data-Infrastructure (Sprint 03); removed Shared-Data-Infrastructure (Sprint 02).

• EChetty moved this task from Sprint 03 to EQ2 Kanban (Sprints 04-07) on the Shared-Data-Infrastructure board.Nov 8 2022, 10:40 AM

• EChetty edited projects, added Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)); removed Shared-Data-Infrastructure (Sprint 03).

• EChetty moved this task from EQ2 Kanban (Sprints 04-07) to Discussed (Tracking) on the Shared-Data-Infrastructure board.Nov 8 2022, 2:13 PM

• EChetty edited projects, added Shared-Data-Infrastructure; removed Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)).

Moving this into the current sprint in order to reflect the work currently being undertaken.

The spec of the current test job is as follows.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-wmf
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "docker-registry.wikimedia.org/spark:3.3.0-2"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar"
  arguments: ["1000"]
  sparkVersion: "3.3.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.3.0
    serviceAccount: spark-driver
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 5
    memory: "512m"
    labels:
      version: 3.3.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

We are currently submitting this job with the spark-deploy kubeconfig credentials and by using kubectl -f spark-pi-wmf.yaml apply

Errors from the driver pods whilst trying to create executors. Related to resource limits.

23/03/09 17:54:43 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: Pod "spark-pi-82463e86c781f0f4-exec-50" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.containers[0].resources.requests, message=Invalid value: "1": must be less than or equal to cpu limit, reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=spark-pi-82463e86c781f0f4-exec-50, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "spark-pi-82463e86c781f0f4-exec-50" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

I will try reducing the requested number of cpu units.

BTullis added a subscriber: Ottomata.Mar 9 2023, 6:15 PM

Executor pods are now well launched:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-wmf
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "docker-registry.wikimedia.org/spark:3.3.0-2"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar"
  arguments: ["1000"]
  sparkVersion: "3.3.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1"
    memory: "512m"
    labels:
      version: 3.3.0
    serviceAccount: spark-driver
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    coreLimit: "1"
    instances: 5
    memory: "512m"
    labels:
      version: 3.3.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

But there is connection issues between executors and driver

23/03/09 19:24:33 ERROR RpcOutboxMessage: Ask terminated before connecting successfully
23/03/09 19:24:33 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to spark-pi-wmf-3115df86c7d25eb9-driver-svc.spark.svc/10.67.26.77:7078 timed out (120000 ms)

Interesting, thanks. Was it simply the addition of the coreLimit: "1" that fixed this?

23/03/09 19:24:33 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to spark-pi-wmf-3115df86c7d25eb9-driver-svc.spark.svc/10.67.26.77:7078 timed out (120000 ms)

Do we need another network policy object to allow these pods to talk to each other?

Yes the coreLimit fixed it
And yes we need a specific NetworkPolicy to have driver and executor -> pushing it this morning

Change 896303 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/deployment-charts@master] spark: Authorize driver and executor pods to communicate

https://gerrit.wikimedia.org/r/896303

gerritbot added a project: Patch-For-Review.Mar 10 2023, 7:54 AM

Here is the last version of the SparkApplication definition to take set driver and blockmanager port

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-wmf
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "docker-registry.wikimedia.org/spark:3.3.0-2"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar"
  arguments: ["1000"]
  sparkVersion: "3.3.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  sparkConf:
    spark.driver.port: "12000"
    spark.driver.blockManager.port: "13000"
    spark.ui.port: "4045"
  driver:
    cores: 1
    coreLimit: "1"
    memory: "512m"
    labels:
      version: 3.3.0
    serviceAccount: spark-driver
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    coreLimit: "1"
    instances: 5
    memory: "512m"
    labels:
      version: 3.3.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Change 896303 merged by Nicolas Fraison:

[operations/deployment-charts@master] spark: Authorize driver and executor pods to communicate

https://gerrit.wikimedia.org/r/896303

Maintenance_bot removed a project: Patch-For-Review.Mar 13 2023, 8:10 AM

Spark job submission work with the new NetworkPolicy and the port config on the SparkApplication
Trying to run a job accessing hdfs

Due to mutation webhook not enabled we can't rely on hadoopConfigMap spec on sparkapplication -> TODO create phab ticket to add the webhook
Currently trying to perform manually actions done by webhook

ConfigMap: hadoop-conf with core and hdfs sites.xml
volumeMounts and volumes configured
HADOOP_CONF_DIR env var exposed

Volumes and volumeMounts which also rely on webhook...

• nfraison assigned this task to BTullis.Mar 13 2023, 10:05 AM

• nfraison moved this task from In Progress to Done on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.

• nfraison updated Other Assignee, added: • nfraison.

@BTullis can you check the boxes before closing the ticket, please? Thanks!

BTullis closed this task as Resolved.Mar 14 2023, 4:32 PM

BTullis updated the task description. (Show Details)

Submit a spark job to the dse-k8s clusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Submit a spark job to the dse-k8s cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...