Update October 20205
We have installed the spark-operator on the dse-k8s-eqiad cluster and it can be used to execute an example job.
However, we now need to facilitate more meaningful testing by the Data-Engineering team, which means that we need spark to be able to reach out to data sources and sinks, with appropriate authentication, authorization, and monitoring capabilities.
Our goal is to have Airflow be able to launch spark jobs that run on the dse-k8s cluster, so we have shifted the focus away from regular users on stat servers, for the time being.
The sparkctl binary has been deprecated and dropped from recent versions of the spark-operator.
The spark-operator project itself has been adopted by kubeflow, so is now hosted at: https://github.com/kubeflow/spark-operator
Original ticket description follows:
This ticket is closely aligned with T318535: Document ideas & investigation results from out spike with "Spark on k8s" [SPIKE - 1.5 Sprints] and forms part of an early, experimentation phase of T308317: Data Infrastructure as a Service MVP, in support of T302728: Analytics Platform Future State Planing.
We would like to be able to test the Spark on K8S operator on the DSE cluster: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
The intended outcome is to be able to execute a spark job as a normal user on a stat box using sparkctl create
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/sparkctl/README.md#create.
The nature of the spark job itself is not important at this stage. It could be stateless.
In future we will need to investigate both HDFS and Ceph storage back-ends capabilities.
Goal:
Run Spark K8 Operator on the DSE Cluster
Task:
- Make the spark-on-k8s operator packages/images available for use
- Add the spark-on-k8s operator privileged components to the dse-k8s cluster
-
Add the sparkctl binary to the stat boxes - Submit a spark job to the dse-k8s cluster
Outcomes:
- Can successfully launch a spark job on the dse-k8s cluster with sparkctl from a stat box and monitor/log its execution.