Update January 2026
Our work is currently focused on: T413977: Deploy Kyuubi to enable dynamic spark-sql clusters in dse-k8s
We had hoped to use spark-thriftserver in a Kubernetes sidecar model.
This would have allowed the creation of a custom, ephemeral spark cluster, alongside each dbt job launched by airflow.
Achieving this has proved difficult because of the fact that our Hive Metastore and HDFS cluster are protected by Kerberos.
Specifically, we cannot run the spark thriftserver and its JDBC interface without also having it be kerberos-enabled.
This kerberos-enabled service is broadly incompatible with ephemeral services in short-lived kubernetes pods, due to the way in which kerberos wants to resolve host names.
For this reason, we are now moving towards having a cetral Kyuubi service, per analytics namespace.
This will still allow us to create ephemeral spark clusters with thrift and JDBC interfaces, but the hostnames and pod names will be static and therefore we will be able to make kerberos work using our current keytab deployment mechanism.
Original description follows
For testing dbt jobs, spark session-mode starts spark jobs from stat machines.
For production jobs launched via airflow, getting the session-mode will be more complicated as it would mean running the dbt from within a skein container.
One different approach would be to provide one SparkThriftSQLServer per production user, allowing Airflow instances to run dbt jobs against those Servers.
- If using skein, the idea would be to test launching dbt from a skein container, with a specific profile for spark to run in session mode.
- If using Spark Thrift-Server the idea would be to
- install one SparkThriftServer using the analytics user
- in k8s if possible, on bare metal/VM otherwise,
- With auto-restart in case of failure (k8s would help for this) as this would use regularly from Airflow
- with a defined DNS to be accessed from Airflow
- Verify that the job can execute SQL and write production data
If we can have all that, the next steps will be to make the replication of such a server easy, to provide this capability for every prod-user on the Hadoop cluster.