Page MenuHomePhabricator

Provide a Spark-on-k8s access for sql tools (dbt)
Open, HighPublic

Description

Update January 2026

Our work is currently focused on: T413977: Deploy Kyuubi to enable dynamic spark-sql clusters in dse-k8s

We had hoped to use spark-thriftserver in a Kubernetes sidecar model.
This would have allowed the creation of a custom, ephemeral spark cluster, alongside each dbt job launched by airflow.

Achieving this has proved difficult because of the fact that our Hive Metastore and HDFS cluster are protected by Kerberos.
Specifically, we cannot run the spark thriftserver and its JDBC interface without also having it be kerberos-enabled.
This kerberos-enabled service is broadly incompatible with ephemeral services in short-lived kubernetes pods, due to the way in which kerberos wants to resolve host names.

For this reason, we are now moving towards having a cetral Kyuubi service, per analytics namespace.
This will still allow us to create ephemeral spark clusters with thrift and JDBC interfaces, but the hostnames and pod names will be static and therefore we will be able to make kerberos work using our current keytab deployment mechanism.


Original description follows

For testing dbt jobs, spark session-mode starts spark jobs from stat machines.

For production jobs launched via airflow, getting the session-mode will be more complicated as it would mean running the dbt from within a skein container.

One different approach would be to provide one SparkThriftSQLServer per production user, allowing Airflow instances to run dbt jobs against those Servers.

  • If using skein, the idea would be to test launching dbt from a skein container, with a specific profile for spark to run in session mode.
  • If using Spark Thrift-Server the idea would be to
    • install one SparkThriftServer using the analytics user
    • in k8s if possible, on bare metal/VM otherwise,
    • With auto-restart in case of failure (k8s would help for this) as this would use regularly from Airflow
    • with a defined DNS to be accessed from Airflow
    • Verify that the job can execute SQL and write production data

If we can have all that, the next steps will be to make the replication of such a server easy, to provide this capability for every prod-user on the Hadoop cluster.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+10 -10
operations/deployment-chartsmaster+129 -100
operations/deployment-chartsmaster+33 -13
operations/deployment-chartsmaster+60 -12
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+8 -5
operations/deployment-chartsmaster+2 -3
operations/deployment-chartsmaster+32 -2
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+5 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+0 -19
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+29 -36
operations/deployment-chartsmaster+1 -67
operations/puppetproduction+8 -0
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+40 -0
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update the kyuubi variantsrepos/data-engineering/spark!50btullisadd_kyuubimain
Add a variant for kyuubirepos/data-engineering/spark!48btullisadd_kyuubimain
Add the pykerberos package to the dbt venvrepos/data-engineering/dbt-jobs!14btullisadd_python_kerberosmain
Add support for spark in thrift moderepos/data-engineering/dbt-jobs!13btullisadd_pyhivemain
Add Kerberos packages to the docker imagerepos/data-engineering/dbt-jobs!12javiermontonfeature/add-kerberos-packagesmain
Add kerberos user utilities to the spark distributionrepos/data-engineering/spark!47btullisadd_krb5_sparkmain
Update the JDK/JRE to Java 11repos/data-engineering/spark!45btullisupdate_jremain
Add the libisal2 and libssl-dev packagesrepos/data-engineering/spark!44btullisadd_missing_runtime_libsmain
Rework the blubber file to reduce repetition and add the iceberg jarrepos/data-engineering/spark!43btullisadd_iceberg_supportmain
Upgrade the Hadoop distribution to 3.4.2 when building spark imagesrepos/data-engineering/spark!42btullisupdate_hadoop_versionmain
Add snappy and ssl support to spark imagesrepos/data-engineering/spark!39btullisadd_snappy_ssl_supportmain
Use the lowercase env variables for proxy settingsrepos/data-engineering/spark!26btullisfix_mvn_proxy_optsmain
Use a different method for setting the maven proxy configurationrepos/data-engineering/spark!25btullisfix_maven_proxymain
Update the maven opts for the publish stagerepos/data-engineering/spark!24btullisfix_maven_optmain
Set the MAVEN_OPTS variable when building on trusted runnersrepos/data-engineering/spark!23btullisfix_maven_proxy_optsmain
Build a custom spark distribution with hive and thriftserver supportrepos/data-engineering/spark!22btullisbuild_sparkmain
Add python3-dev and procps to the spark distribution imagerepos/data-engineering/spark!21btullisupdate_spark_imagemain
Create a dbt DAG that launches a pod containing a dbt-jobs pod and a spark-thriftserver sidecarrepos/data-engineering/airflow-dags!1820btullisspark_thrift_dbtmain
Show related patches Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1212130 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bind the spark-job-orchestration role to the default serviceaccount

https://gerrit.wikimedia.org/r/1212130

I ran into a bit of a problem because the pre-built versions of spark did not include support for hive or the thriftserver.

btullis@deploy2002:~$ kubectl exec -it dbt-debug-ry7aiyd -c spark-thriftserver -- /entrypoint.sh bash
<snip>
spark@dbt-debug-ry7aiyd:/tmp$ /opt/spark/sbin/start-thriftserver.sh
<snip>
  ========================================
  Error: Failed to load class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
  Failed to load main class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
  You need to build Spark with -Phive and -Phive-thriftserver.

So in order to get:

  • version 3.5.7 of spark
  • alongside version 2.10.2 of hadoop
  • with the correct build flags to support the thriftserver

... I needed to update the build process to build spark from source.

I've done that now, so the latest version of the spark image now has support for running the thriftserver:

The next step is to get the default serviceaccount in the analytics-test namespace the required privileges to create and manage pods etc.

I've been running this command as a test:

spark-submit --master k8s://https://kubernetes.default \
 --deploy-mode cluster \
 --name spark-pi \
 --class org.apache.spark.examples.SparkPi \
 --conf spark.executor.instances=5 \
 --conf spark.kubernetes.container.image=docker-registry.discovery.wmnet/repos/data-engineering/spark:3.5.7-2025-12-01-152105-a7a1e7a2edebcf5343245f56a7efab1be4b317b4 \
 --conf spark.kubernetes.namespace=analytics-test \
 local:///opt/spark/examples/jars/spark-examples_2.12-3.5.7.jar

The error messages indicate that the default serviceaccount cannot create pods.

16:34:53.098 [main] ERROR org.apache.spark.deploy.k8s.submit.Client - Please check "kubectl auth can-i create pod" first. It should be yes.
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default/api/v1/namespaces/analytics-test/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods is forbidden: User "system:serviceaccount:analytics-test:default" cannot create resource "pods" in API group "" in the namespace "analytics-test".

We could also look at using a custom serviceaccount. e.g.

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

...but in my testing, this didn't seem to be applied correctly, and I was led to this bug: https://issues.apache.org/jira/browse/SPARK-26295

Therefore, it might be better to modify the default serviceaccount for each spark-enabled namespace.

Change #1214020 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove unnecessary and/or incorrect hadoop/spark config options

https://gerrit.wikimedia.org/r/1214020

Change #1214020 merged by jenkins-bot:

[operations/deployment-charts@master] Remove unnecessary and/or incorrect hadoop/spark config options

https://gerrit.wikimedia.org/r/1214020

Change #1214091 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the spark configuration

https://gerrit.wikimedia.org/r/1214091

Change #1214091 merged by jenkins-bot:

[operations/deployment-charts@master] Update the spark configuration

https://gerrit.wikimedia.org/r/1214091

I'm still making steady progress on this, but there is one more thing that I might have to fix, which is the openssl support.

I'm getting the following output from hadoop checknative in the spark container.

spark@dbt-debug-m5phhkk:/tmp$ hadoop checknative
25/12/02 19:02:58 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
25/12/02 19:02:58 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /opt/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib/x86_64-linux-gnu/libz.so.1
snappy:  true /lib/x86_64-linux-gnu/libsnappy.so.1
zstd  :  true /lib/x86_64-linux-gnu/libzstd.so.1
lz4:     true revision:10301
bzip2:   true /lib/x86_64-linux-gnu/libbz2.so.1
openssl: false EVP_CIPHER_CTX_block_size

This led me to these bugs:

Ultimately, I believe that this is related to the fact that Debian bookworm uses openssl version 3 and Hadoop 2.10.2 still needs libssl1.1

I think that I can get around this by using the same workaround that I did here.

I'm taking a slightly different approach to this, since I discovered a problem using Spark 3.5.7 with Hadoop 2.10.2.

I had got to the point where I was running the following command in a spark container:

pyspark --conf spark.kerberos.keytab=/etc/security/keytabs/analytics.keytab --conf spark.kerberos.principal=analytics/analytics-test.discovery.wmnet@WIKIMEDIA

The spark-defaults.conf file at this point contains the following configuration statements.

spark@dbt-debug-m5phhkk:/tmp$ cat /opt/spark/conf/spark-defaults.conf

spark.authenticate  true
spark.dynamicAllocation.cachedExecutorIdleTimeout  3600s
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.executorIdleTimeout  60s
spark.eventLog.compress  true
spark.eventLog.dir  hdfs:///var/log/spark
spark.eventLog.enabled  true
spark.kubernetes.authenticate.driver.serviceAccountName  spark
spark.network.crypto.enabled  true
spark.network.crypto.keyFactoryAlgorithm  PBKDF2WithHmacSHA256
spark.network.crypto.keyLength  256
spark.network.crypto.saslFallback  false
spark.sql.catalog.spark_catalog  org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type  hive
spark.sql.catalogImplementation  hive
spark.sql.extensions  org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.files.maxPartitionBytes  268435456
spark.sql.warehouse.dir  hdfs:///user/hive/warehouse

The following error message appeared.

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.DistributedFileSystem$HdfsDataOutputStreamBuilder.replicate()Lorg/apache/hadoop/hdfs/DistributedFileSystem$HdfsDataOutputStreamBuilder;
	at org.apache.spark.deploy.SparkHadoopUtil$.createFile(SparkHadoopUtil.scala:578)
	at org.apache.spark.deploy.history.EventLogFileWriter.initLogFile(EventLogFileWriters.scala:98)
	at org.apache.spark.deploy.history.SingleEventLogFileWriter.start(EventLogFileWriters.scala:223)
	at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:81)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:632)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)

I believe that this indicates a serious impediment to continuing to use Hadoop 2.10.2 with Spark 3.5.7 because this method:

DistributedFileSystem$HdfsDataOutputStreamBuilder.replicate()

...was only introduced in Hadoop 3.3.

At this point, I believe that we have two options to continue using Spark on Kubernetes for this project:

  1. Upgrade Hadoop and rely on the backward compatibility of the HDFS version 3.3 jars
  2. Downgrade Spark to a version that we know works with the HDFS version 2.10 jars

I'm investigating option 1) for this, first. It does mean that we have to drop support for YARN from this spark build, because we know that the YARN client jars are not backward compatible.
However, I don't believe that this poses any difficulty for using our existing hive metastore, since that uses a Thrift API, which should be stable between these versions.

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/43

Rework the blubber file to reduce repetition and add the iceberg jar

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/43

Rework the blubber file to reduce repetition and add the iceberg jar

Change #1215556 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump spark image

https://gerrit.wikimedia.org/r/1215556

I am continuing to make progress, but now I have an issue with the iceberg version:

spark@dbt-debug-kjlbkxo:/tmp$ spark-sql
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/iceberg/spark/ExtendedParser has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0

I have been trying to use Iceberg version 1.10.0

The last version of Iceberg to support Java 8 was version 1.6.1 - but that would mean downgrading Spark to version 3.5.1

I will try with a Java 11 based build, instead.

Change #1215556 merged by jenkins-bot:

[operations/deployment-charts@master] Bump spark image

https://gerrit.wikimedia.org/r/1215556

Change #1217495 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the spark-support chart

https://gerrit.wikimedia.org/r/1217495

Change #1217495 merged by jenkins-bot:

[operations/deployment-charts@master] Update the spark-support chart

https://gerrit.wikimedia.org/r/1217495

Change #1217531 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the spark image that is deployed to analytics-test

https://gerrit.wikimedia.org/r/1217531

Change #1217531 merged by jenkins-bot:

[operations/deployment-charts@master] Update the spark image that is deployed to analytics-test

https://gerrit.wikimedia.org/r/1217531

Change #1217535 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the kerberos settings for the hive thriftserver

https://gerrit.wikimedia.org/r/1217535

Change #1217535 merged by jenkins-bot:

[operations/deployment-charts@master] Update the kerberos settings for the hive thriftserver

https://gerrit.wikimedia.org/r/1217535

Change #1217554 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the keytab path for spark-support

https://gerrit.wikimedia.org/r/1217554

Change #1217554 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the keytab path for spark-support

https://gerrit.wikimedia.org/r/1217554

Change #1217568 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Allow the spark serviceaccount to perform more actions

https://gerrit.wikimedia.org/r/1217568

Change #1217568 merged by jenkins-bot:

[operations/deployment-charts@master] Allow the spark serviceaccount to perform more actions

https://gerrit.wikimedia.org/r/1217568

Change #1218271 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a dbt profiles.yml file to the spark-support chart

https://gerrit.wikimedia.org/r/1218271

Change #1218271 merged by jenkins-bot:

[operations/deployment-charts@master] Add a dbt profiles.yml file to the spark-support chart

https://gerrit.wikimedia.org/r/1218271

Change #1218379 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the hostname used by dbt in the spark-support chart

https://gerrit.wikimedia.org/r/1218379

Change #1218379 merged by jenkins-bot:

[operations/deployment-charts@master] Update the hostname used by dbt in the spark-support chart

https://gerrit.wikimedia.org/r/1218379

Change #1218741 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the spark config

https://gerrit.wikimedia.org/r/1218741

Change #1218741 merged by jenkins-bot:

[operations/deployment-charts@master] Update the spark config

https://gerrit.wikimedia.org/r/1218741

I've run into a hard blocker with the spark-thriftserver sidecar method, so we are currently looking at alternative options.

The reason for the blocker is as follows:

  • When connecting to an upstream hive-metastore that uses Kerberos, it seems to be required also to use kerberos for the local thrift server port.
  • The kerberos authentication of the hive thriftserver service is also failing, because it wants the hostname part of the principal to match the reverse DNS lookup of the pod.

I.e. if we configure the principal of the services as: analytics/analytics-test.discovery.wmnet@WIKIMEDIA then the thrift servers process wants the pod's reverse DNS to resolve to analytics-test.discovery.wmnet
If this isn't the case, then the service will not start.

I tried to use some hostAlias entries that caused analytics-test.discovery.wmnet to resolve back to 127.0.0.1 but this didn't work, either.
The kerberos authentication might have been OK, but this DNS trick caused Java to bind only to the localhost interface, so it couldn't reach out to HDFS etc.

So now we're looking into alternative ways of getting this to work.

One of the most promising methods looks to be using Apache Kyuubi in this sidecar pattern.
We will investigate whether we can do this and make use of the connection level isolation.

Change #1220369 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Revert changes to the principal for the spark-thriftserver

https://gerrit.wikimedia.org/r/1220369

Change #1220369 merged by jenkins-bot:

[operations/deployment-charts@master] Revert changes to the principal for the spark-thriftserver

https://gerrit.wikimedia.org/r/1220369

Change #1220644 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a kyuubi deployment to the spark-support chart for analytics-test

https://gerrit.wikimedia.org/r/1220644

Change #1220644 merged by jenkins-bot:

[operations/deployment-charts@master] Add a kyuubi deployment to the spark-support chart for analytics-test

https://gerrit.wikimedia.org/r/1220644

Change #1223646 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Configure the contents of /etc/kyuubi/conf for the kyuubi toolbox pod

https://gerrit.wikimedia.org/r/1223646

Change #1223646 merged by jenkins-bot:

[operations/deployment-charts@master] Configure the contents of /etc/kyuubi/conf for the kyuubi toolbox pod

https://gerrit.wikimedia.org/r/1223646

Change #1223661 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the volumes and volumemounts of the kyuubi toolbox

https://gerrit.wikimedia.org/r/1223661

Change #1223661 merged by jenkins-bot:

[operations/deployment-charts@master] Update the volumes and volumemounts of the kyuubi toolbox

https://gerrit.wikimedia.org/r/1223661

Change #1223663 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the configmap names for spark-support

https://gerrit.wikimedia.org/r/1223663

Change #1223663 merged by jenkins-bot:

[operations/deployment-charts@master] Update the configmap names for spark-support

https://gerrit.wikimedia.org/r/1223663

Change #1223668 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the name of the kerberos-client-configuration configmap

https://gerrit.wikimedia.org/r/1223668

Change #1223668 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the name of the kerberos-client-configuration configmap

https://gerrit.wikimedia.org/r/1223668

Time for a further update on this.

We have been experimenting with running kyuubi in a toolbox pod, in the same way that we were attempting to run the spark-thriftserver.

The intention had been to use this in an ephemeral manner, in the way that we had previously been attempting to do with the spark-thriftserver.
However, we have run up against a very similar problem with kyuubi in this configuration, to that we faced with the spark-thriftserver.

Namely, when the Hive Metastore and HDFS cluster to which the Kyuubi engine connects are protected by Kerberos, the Kyuubi server itself also needs to run with Kerberos enabled.
We cannot easily do this with ephemeral pods because we do not automatically generate kerberos principals that can match the pod name. Our principals are managed manually and the keytabs are statically deployed to kubernetes, so this does not work well when we wish to have multiple kyuubi servers.

Having discussed it with @JAllemandou, we have come up with an alternative plan. We now intend to change the deployment plan for Kyuubi, so that instead of creating ephemeral kyuubi servers, we will use a single instance per namespace.
This will allow us to use a fixed pod name that will match our kerberos principal FQDN and we will be able to use a kubernetes service to allow clients to connect using the same hostname.

Our DBT containers will need to have their own kerberos keytab and credential renewer, but will no longer be looking at running a sidecar to run kyuubi or spark-thriftserver.

We will still use the CONNECTION share level, so we will still get a spark cluster created per JDBC connection from a DBT job.
This will also allow customization of each spark cluster using the JDBC connection string, such as:

jdbc:hive2://localhost:10009/default;#spark.sql.shuffle.partitions=2;spark.executor.memory=5g

I will create a new ticket for this work to deploy and test a single Kyuubi instance per analytics-enabled namespace.

is this task paused because its dependent on the success of T413977 ?

Change #1226261 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the image used for the spark-toolbox

https://gerrit.wikimedia.org/r/1226261

Change #1226261 merged by jenkins-bot:

[operations/deployment-charts@master] Update the image used for the spark-toolbox

https://gerrit.wikimedia.org/r/1226261

Change #1212130 abandoned by Btullis:

[operations/deployment-charts@master] Bind the spark-job-orchestration role to the default serviceaccount

Reason:

No longer needed

https://gerrit.wikimedia.org/r/1212130

JAllemandou renamed this task from Provide a Spark production access for dbt with Airflow to Provide a Spark-on-k8s access for sql tools (dbt).Feb 4 2026, 11:24 AM

I'm moving this back to the main Data-Platform-SRE backlog, since we're not working on it in the immediate future.
We have implemented a solution for dbt by adding it to conda-analytics and launching this on YARN, via skein.

That is good enough for now, but we still want to have a way to do this nicely with spark running natively under kuibernetes.
Therefore, I will leave it open and unassign myself.