Page MenuHomePhabricator

Submit a spark job to the dse-k8s cluster
Closed, ResolvedPublic

Description

We need to be able to verify that we are able to create SparkApplication object in Kubernetes and have it create the necessary driver and executor pods.

At this stage it does not matter whether or not the job consumes any input or creates any output, it is simply to test the correct functioning of the spark-operator.

Event Timeline

BTullis subscribed.

Moving this into the current sprint in order to reflect the work currently being undertaken.

The spec of the current test job is as follows.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-wmf
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "docker-registry.wikimedia.org/spark:3.3.0-2"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar"
  arguments: ["1000"]
  sparkVersion: "3.3.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.3.0
    serviceAccount: spark-driver
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 5
    memory: "512m"
    labels:
      version: 3.3.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

We are currently submitting this job with the spark-deploy kubeconfig credentials and by using kubectl -f spark-pi-wmf.yaml apply

BTullis edited subscribers, added: nfraison, Stevemunene; removed: EChetty.

Errors from the driver pods whilst trying to create executors. Related to resource limits.

23/03/09 17:54:43 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: Pod "spark-pi-82463e86c781f0f4-exec-50" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.containers[0].resources.requests, message=Invalid value: "1": must be less than or equal to cpu limit, reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=spark-pi-82463e86c781f0f4-exec-50, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "spark-pi-82463e86c781f0f4-exec-50" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

I will try reducing the requested number of cpu units.

Executor pods are now well launched:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-wmf
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "docker-registry.wikimedia.org/spark:3.3.0-2"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar"
  arguments: ["1000"]
  sparkVersion: "3.3.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1"
    memory: "512m"
    labels:
      version: 3.3.0
    serviceAccount: spark-driver
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    coreLimit: "1"
    instances: 5
    memory: "512m"
    labels:
      version: 3.3.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

But there is connection issues between executors and driver

23/03/09 19:24:33 ERROR RpcOutboxMessage: Ask terminated before connecting successfully
23/03/09 19:24:33 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to spark-pi-wmf-3115df86c7d25eb9-driver-svc.spark.svc/10.67.26.77:7078 timed out (120000 ms)

Interesting, thanks. Was it simply the addition of the coreLimit: "1" that fixed this?

23/03/09 19:24:33 WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connecting to spark-pi-wmf-3115df86c7d25eb9-driver-svc.spark.svc/10.67.26.77:7078 timed out (120000 ms)

Do we need another network policy object to allow these pods to talk to each other?

Yes the coreLimit fixed it
And yes we need a specific NetworkPolicy to have driver and executor -> pushing it this morning

Change 896303 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/deployment-charts@master] spark: Authorize driver and executor pods to communicate

https://gerrit.wikimedia.org/r/896303

Here is the last version of the SparkApplication definition to take set driver and blockmanager port

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-wmf
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "docker-registry.wikimedia.org/spark:3.3.0-2"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar"
  arguments: ["1000"]
  sparkVersion: "3.3.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  sparkConf:
    spark.driver.port: "12000"
    spark.driver.blockManager.port: "13000"
    spark.ui.port: "4045"
  driver:
    cores: 1
    coreLimit: "1"
    memory: "512m"
    labels:
      version: 3.3.0
    serviceAccount: spark-driver
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    coreLimit: "1"
    instances: 5
    memory: "512m"
    labels:
      version: 3.3.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Change 896303 merged by Nicolas Fraison:

[operations/deployment-charts@master] spark: Authorize driver and executor pods to communicate

https://gerrit.wikimedia.org/r/896303

Spark job submission work with the new NetworkPolicy and the port config on the SparkApplication
Trying to run a job accessing hdfs

Due to mutation webhook not enabled we can't rely on hadoopConfigMap spec on sparkapplication -> TODO create phab ticket to add the webhook
Currently trying to perform manually actions done by webhook

  • ConfigMap: hadoop-conf with core and hdfs sites.xml
  • volumeMounts and volumes configured
  • HADOOP_CONF_DIR env var exposed

Volumes and volumeMounts which also rely on webhook...

@BTullis can you check the boxes before closing the ticket, please? Thanks!

BTullis updated the task description. (Show Details)