Page MenuHomePhabricator

Deploy spark-operator to the dse-k8s cluster
Closed, ResolvedPublic5 Estimated Story Points

Description

Primary Task
Modification to the key deployment helm-charts to enable the kubernetes operator to have privileged rights.

Key tasks:

  • Define a helm chart for spark-operator using the upstream chart for inspiration
  • Define a helmfile deployment for spark-operator on the dse-cluster
  • Configure the new service and any requirements such as users/tokens/namespace etc.
  • Deply the new service and test

Event Timeline

BTullis renamed this task from Add the spark-on-k8s operator privileged components to the dse-k8s cluster to Deploy spark-operator to the dse-k8s cluster.Oct 7 2022, 1:14 PM
BTullis claimed this task.
BTullis triaged this task as High priority.
BTullis updated the task description. (Show Details)

Change 855674 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a spark-operator chart

https://gerrit.wikimedia.org/r/855674

I feel that I'm making reasonable progress on this now. I have a helm chart and a helmfile deployment that I think is almost ready for review. https://gerrit.wikimedia.org/r/855674

So far I have been able to:

  • deploy our custom spark-operator to minikube
  • the operator is created in the spark-operator namespace
  • it monitors the spark namespace for SparkApplication and ScheduledSparkApplication API requests
  • when it detects a valid request it launches a spark driver pod, using our custom spark image.
  • this driver pod then runs spark-submit with the application parameters given
  • a specified number of executor pods are then launched, which copy the jar files from the driver pod and then begin the execution

The only application I have run so far has been the SparkPi example code, which is an example that is local to the spark pod.

This is the SparkApplication definition that I applied with kubectl apply -f spark-pi-wmf.yaml

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-wmf
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "docker-registry.wikimedia.org/spark:3.3.0-2"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar"
  arguments: ["1000"]
  sparkVersion: "3.3.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.3.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 5
    memory: "512m"
    labels:
      version: 3.3.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

This is the tail end of kubectl logs -n spark spark-pi-wmf-driver showing that it was successful and including the

+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.6 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
Pi is roughly 3.1416266714162666

This is the output from kubectl get events -n spark --watch during this operation.

0s          Normal    SparkApplicationAdded     sparkapplication/spark-pi-wmf   SparkApplication spark-pi-wmf was added, enqueuing it for submission
0s          Normal    Scheduled                 pod/spark-pi-wmf-driver         Successfully assigned spark/spark-pi-wmf-driver to minikube
0s          Normal    SparkApplicationSubmitted   sparkapplication/spark-pi-wmf   SparkApplication spark-pi-wmf was submitted successfully
0s          Normal    Pulled                      pod/spark-pi-wmf-driver         Container image "docker-registry.wikimedia.org/spark:3.3.0-2" already present on machine
0s          Normal    Created                     pod/spark-pi-wmf-driver         Created container spark-kubernetes-driver
0s          Normal    Started                     pod/spark-pi-wmf-driver         Started container spark-kubernetes-driver
0s          Normal    SparkDriverRunning          sparkapplication/spark-pi-wmf   Driver spark-pi-wmf-driver is running
0s          Normal    SparkExecutorPending        sparkapplication/spark-pi-wmf   Executor [spark-pi-effe538467bf5147-exec-1] is pending
0s          Normal    Scheduled                   pod/spark-pi-effe538467bf5147-exec-1   Successfully assigned spark/spark-pi-effe538467bf5147-exec-1 to minikube
0s          Normal    SparkExecutorPending        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-2] is pending
0s          Normal    Scheduled                   pod/spark-pi-effe538467bf5147-exec-2   Successfully assigned spark/spark-pi-effe538467bf5147-exec-2 to minikube
0s          Normal    SparkExecutorPending        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-3] is pending
0s          Normal    Scheduled                   pod/spark-pi-effe538467bf5147-exec-3   Successfully assigned spark/spark-pi-effe538467bf5147-exec-3 to minikube
0s          Normal    SparkExecutorPending        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-3] is pending
0s          Normal    SparkExecutorPending        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-4] is pending
0s          Normal    Scheduled                   pod/spark-pi-effe538467bf5147-exec-4   Successfully assigned spark/spark-pi-effe538467bf5147-exec-4 to minikube
0s          Normal    SparkExecutorPending        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-4] is pending
0s          Normal    Scheduled                   pod/spark-pi-effe538467bf5147-exec-5   Successfully assigned spark/spark-pi-effe538467bf5147-exec-5 to minikube
0s          Normal    SparkExecutorPending        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-5] is pending
0s          Normal    Pulled                      pod/spark-pi-effe538467bf5147-exec-1   Container image "docker-registry.wikimedia.org/spark:3.3.0-2" already present on machine
0s          Normal    Created                     pod/spark-pi-effe538467bf5147-exec-1   Created container spark-kubernetes-executor
0s          Normal    Started                     pod/spark-pi-effe538467bf5147-exec-1   Started container spark-kubernetes-executor
0s          Normal    Pulled                      pod/spark-pi-effe538467bf5147-exec-2   Container image "docker-registry.wikimedia.org/spark:3.3.0-2" already present on machine
0s          Normal    Created                     pod/spark-pi-effe538467bf5147-exec-2   Created container spark-kubernetes-executor
0s          Normal    Started                     pod/spark-pi-effe538467bf5147-exec-2   Started container spark-kubernetes-executor
0s          Normal    Pulled                      pod/spark-pi-effe538467bf5147-exec-3   Container image "docker-registry.wikimedia.org/spark:3.3.0-2" already present on machine
0s          Normal    Created                     pod/spark-pi-effe538467bf5147-exec-3   Created container spark-kubernetes-executor
0s          Normal    Started                     pod/spark-pi-effe538467bf5147-exec-3   Started container spark-kubernetes-executor
0s          Normal    Pulled                      pod/spark-pi-effe538467bf5147-exec-4   Container image "docker-registry.wikimedia.org/spark:3.3.0-2" already present on machine
0s          Normal    Created                     pod/spark-pi-effe538467bf5147-exec-4   Created container spark-kubernetes-executor
0s          Normal    Pulled                      pod/spark-pi-effe538467bf5147-exec-5   Container image "docker-registry.wikimedia.org/spark:3.3.0-2" already present on machine
0s          Normal    Created                     pod/spark-pi-effe538467bf5147-exec-5   Created container spark-kubernetes-executor
0s          Normal    SparkExecutorRunning        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-1] is running
1s          Normal    Started                     pod/spark-pi-effe538467bf5147-exec-4   Started container spark-kubernetes-executor
0s          Normal    SparkExecutorRunning        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-3] is running
0s          Normal    Started                     pod/spark-pi-effe538467bf5147-exec-5   Started container spark-kubernetes-executor
0s          Normal    SparkExecutorRunning        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-2] is running
0s          Normal    SparkExecutorRunning        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-5] is running
0s          Normal    SparkExecutorRunning        sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-4] is running
0s          Normal    Killing                     pod/spark-pi-effe538467bf5147-exec-1   Stopping container spark-kubernetes-executor
0s          Normal    Killing                     pod/spark-pi-effe538467bf5147-exec-2   Stopping container spark-kubernetes-executor
0s          Normal    Killing                     pod/spark-pi-effe538467bf5147-exec-3   Stopping container spark-kubernetes-executor
0s          Normal    Killing                     pod/spark-pi-effe538467bf5147-exec-4   Stopping container spark-kubernetes-executor
0s          Normal    Killing                     pod/spark-pi-effe538467bf5147-exec-5   Stopping container spark-kubernetes-executor
0s          Normal    SparkExecutorCompleted      sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-2] completed
0s          Normal    SparkExecutorCompleted      sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-1] completed
0s          Normal    SparkExecutorCompleted      sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-3] completed
0s          Normal    SparkExecutorCompleted      sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-4] completed
0s          Normal    SparkDriverCompleted        sparkapplication/spark-pi-wmf          Driver spark-pi-wmf-driver completed
0s          Normal    SparkApplicationCompleted   sparkapplication/spark-pi-wmf          SparkApplication spark-pi-wmf completed
0s          Normal    SparkExecutorCompleted      sparkapplication/spark-pi-wmf          Executor [spark-pi-effe538467bf5147-exec-5] completed

I'd like to be able to deploy this to the dse-k8s cluster next, so I can repeat this same test. Once that test with SparkPi has been shown to work, we can look at some more complicated examples where access to remote storage can be configured.

Change 856938 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Test CR - not to be merged

https://gerrit.wikimedia.org/r/856938

Change 856938 abandoned by Btullis:

[operations/deployment-charts@master] Test CR - not to be merged

Reason:

This was never intended to be merged, only for troubleshooting purposes.

https://gerrit.wikimedia.org/r/856938

I have marked the helm chart, the helmfile configuration and the supporting items as complete, although technically they are still in review.
I have been able to deploy our custom chart to minikube, so I am now hoping to test this deployment on the dse-k8s cluster.
The RBAC rules have been split into a sub-ticket T322635: Define necessary RBAC rules for spark on dse-k8s cluster

I'm re-writing these to take account of the improvements to the helm chart templates brought about by the new organization of templates.

Change 864770 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/docker-images/production-images@master] Update the spark images to remove upstream support for the webhook

https://gerrit.wikimedia.org/r/864770

Change 864770 merged by Btullis:

[operations/docker-images/production-images@master] Update the spark images to remove upstream support for the webhook

https://gerrit.wikimedia.org/r/864770

Change 884896 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/docker-images/production-images@master] Revert changes to the maven proxy configuration that didn't work

https://gerrit.wikimedia.org/r/884896

Change 884896 merged by Btullis:

[operations/docker-images/production-images@master] Revert changes to the maven proxy configuration that didn't work

https://gerrit.wikimedia.org/r/884896

Change 884896 merged by Btullis:

[operations/docker-images/production-images@master] Revert changes to the maven proxy configuration that didn't work

https://gerrit.wikimedia.org/r/884896

Change 884896 merged by Btullis:

[operations/docker-images/production-images@master] Revert changes to the maven proxy configuration that didn't work

https://gerrit.wikimedia.org/r/884896

Thanks @bd808 - that's very helpful. For the time being I have simply reverted back to using hard-coded webproxy and 8080 in the command line, which does work.

Change 887994 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the kubectl config files generated for the dse-k8s cluster

https://gerrit.wikimedia.org/r/887994

Change 887994 merged by Btullis:

[operations/puppet@production] Update the kubectl config files generated for the dse-k8s cluster

https://gerrit.wikimedia.org/r/887994

Change 855674 merged by jenkins-bot:

[operations/deployment-charts@master] Add a spark-operator chart and helmfile configuration

https://gerrit.wikimedia.org/r/855674

Change 895799 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove an invalid namespace definition from the spark-operator chart

https://gerrit.wikimedia.org/r/895799

Change 895799 merged by jenkins-bot:

[operations/deployment-charts@master] Remove an invalid namespace definition from the spark-operator chart

https://gerrit.wikimedia.org/r/895799

We attempted a deployment, but encountered an error when deploying the changes to the namespaces.

The initial deployment command was helmfile -e dse-k8s-eqiad -i apply

The changes to the rbac-rules release were deployed successfully, but it failed on the namespaces change.
We later confirmed this with a more focused: helmfile -e dse-k8s-eqiad -l name=namespaces sync

The error displayed was:

STDERR:
  Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch "deploy" with kind RoleBinding: RoleBinding.rbac.authorization.k8s.io "deploy" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"deploy-sparkapplications"}: cannot change roleRef: cannot patch "deploy" with kind RoleBinding: RoleBinding.rbac.authorization.k8s.io "deploy" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"deploy-sparkapplications"}: cannot change roleRef

We're now looking into the cause of this this error.

Change 895837 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the version of the spark-operator that we deploy

https://gerrit.wikimedia.org/r/895837

Change 895837 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the version of the spark-operator that we deploy

https://gerrit.wikimedia.org/r/895837

Change 895842 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Specify docker image and version consistently

https://gerrit.wikimedia.org/r/895842

Change 896053 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/docker-images/production-images@master] spark-operator: rely on exec entrypoint instead of shell one

https://gerrit.wikimedia.org/r/896053

This is now successfully deployed.

Any further changes will be more incremental, as we being to test and expand on the functionality.

Change 896053 merged by Nicolas Fraison:

[operations/docker-images/production-images@master] spark-operator: rely on exec entrypoint instead of shell one

https://gerrit.wikimedia.org/r/896053

Change 895842 merged by jenkins-bot:

[operations/deployment-charts@master] Update the spark-operator chart with consistent image details

https://gerrit.wikimedia.org/r/895842