Page MenuHomePhabricator

Research and test methods for accessing kerberized services from spark running on the DSE K8S cluster
Closed, ResolvedPublic5 Estimated Story Points

Assigned To
Authored By
nfraison
Feb 21 2023, 2:33 PM
Referenced Files
F36887144: spark-pi.yaml
Mar 1 2023, 4:35 PM
F36875615: constraint.yaml
Feb 28 2023, 8:57 AM
F36875614: template-constraint.yaml
Feb 28 2023, 8:57 AM
F36874275: spark-pi.yaml
Feb 27 2023, 10:52 AM
F36874276: svc-account.yaml
Feb 27 2023, 10:52 AM
F36866606: spark-pi.yaml
Feb 23 2023, 3:47 PM

Description

Perform some tests to answer those points

  • Is it possible to provide Kerberos user creds to the spark job submitted or will it be a shared key?
  • Is the mechanism secure enough? Is there some auditing?

Done is

Event Timeline

Side note for my comprehension of our current setup:

DSE-K8S cluster:

# new dse-k8s-crtl control plane servers T310171
node /^dse-k8s-ctrl100[12]\.eqiad\.wmnet$/ {
    role(dse_k8s::master)
}

# new dse-k8s-etcd etcd cluster servers T310170
node /^dse-k8s-etcd100[1-3]\.eqiad\.wmnet$/ {
    role(etcd::v3::dse_k8s_etcd)
}

# new dse-k8s-workers T29157 and T3074009
node /^dse-k8s-worker100[1-8]\.eqiad\.wmnet$/ {
    role(dse_k8s::worker)
}
JArguello-WMF set the point value for this task to 5.

Here is how long running job are managed within a hadoop cluster by long running job we are speaking about job for which the max renewable period of tokens is not enough so for which a kerberos re-authent is required somehow: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html#Securing_Long-lived_YARN_Services
Could be interesting looking at how they deal with it to see if we can somehow reproduce the approach

Another idea would be to rely on the local kerberos ticket cache and to send it to the spark jobs/spark submit: https://web.mit.edu/kerberos/krb5-1.12/doc/basic/ccache_def.html

On the spark operator it manage secret with specific tag for HadoopDelegationToken: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#mounting-secrets

This rely on https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook to expose secret of type HadoopDelegationToken to HADOOP_TOKEN_FILE_LOCATION env var looking for hadoop.token

From that we could imagine sending the hadoop token file as secret of the kerberos ticket cache and create our own mutation admission webhook for this
One potential advantage of kerberos ticket cache is the fact that it could provide capability to also connect to services like presto (which doesn't seem to provide token)

If we only need access to hdfs which is probably the case for now, a potential solution is:

  • To be run from a stat or equivalent node
  • kinit
  • hdfs fetchdt hdfs_token_nfraison => create file containing hdfs token
  • push this token base64 encoded in a k8s secret in map hadoop.token named hdfs_token_nfraison
  • push SparkApplication yaml with below secret config
spec:
  driver:
    secrets:
      - name: hdfs-token-nfraison
        path: /mnt/secrets
        secretType: HadoopDelegationToken
spec:
  executor:
    secrets:
      - name: hdfs-token-nfraison
        path: /mnt/secrets
        secretType: HadoopDelegationToken

All of those step can be bundle in a script to reduce toil for user

Test performed on spark operator with hdfs delegation token

  • Start locally a spark-operator on minikube
minikube start
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace
  • Create a sparkapplication with a secret containing the base64 encoded token, Containing only hdfs token: token file created on stat1005 using command hdfs fetchdt token

kubectl apply -f spark-pi.yaml

Result the job run well with HADOOP_TOKEN_FILE_LOCATION env variable being well set to /mnt/secrets/hadoop.token.
Also if the data in the hadoop.token file is not a good token file the spark job failed failing to decode it. Which confirm that the token/en var is well taken in account.

As no command is available to create hive delegation token created below java code to generate a token containing hdfs and hive delegation token

package org.example;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.HdfsConfiguration;
import org.apache.hadoop.hive.conf.HiveConf;
import org.apache.hadoop.hive.ql.metadata.Hive;
import org.apache.hadoop.hive.shims.Utils;
import org.apache.hadoop.security.Credentials;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.security.token.Token;
import java.io.IOException;
import java.security.PrivilegedExceptionAction;


public class Main {
    public static void main(String[] args) throws IOException, InterruptedException {
        final Configuration conf = new HdfsConfiguration();

        FileSystem local = FileSystem.getLocal(conf);
        final Path tokenFile = new Path(local.getWorkingDirectory(), "token");
        
        UserGroupInformation.getCurrentUser().doAs(
                new PrivilegedExceptionAction<Object>() {
                    @Override
                    public Object run() throws Exception {
                        FileSystem fs = FileSystem.get(conf);
                        Credentials cred = new Credentials();
                        Token<?> tokens[] = fs.addDelegationTokens(null, cred);
                        //cred.writeTokenStorageFile(tokenFile, conf);
                        for (Token<?> token : tokens) {
                            System.out.println("Fetched token for " + token.getService()
                                    + " into " + tokenFile);
                        }


                        HiveConf hiveConf = new HiveConf();
                        Hive hive = Hive.get(hiveConf);

                        String tokenService = "thrift:" + hiveConf.get("hive.metastore.uris")
                                .replace("thrift://", "")
                                .split(":")[0]
                                .split("\\.")[0];

                        String hiveToken = hive.getDelegationToken(UserGroupInformation.getCurrentUser().getShortUserName(), UserGroupInformation.getCurrentUser().getShortUserName());
                        System.out.println(hiveToken);

                        Utils.setTokenStr(UserGroupInformation.getCurrentUser(), hiveToken, "DelegationTokenForHiveMetaStoreServer");

                        UserGroupInformation.getCurrentUser().addCredentials(cred);

                        UserGroupInformation.getCurrentUser().getCredentials().writeTokenStorageFile(tokenFile, conf);
                        return null;
                    }
                });
    }
}

Then execute it from a stat server: java -cp spark-delegation-token-1.0-SNAPSHOT.jar:/etc/hive/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/share/java/apache-log4j-extras.jar:/usr/lib/hive/lib/* org.example.Main
It generate a token file containing both delegation token

Validation of this token file
On the same stat server try to run pyspark3 command with no kerberos ticket and enforcing usage of token

export HADOOP_TOKEN_FILE_LOCATION=/home/nfraison/token
# we need to add this specific hive config in hive-site.xml so hive library knows which token to select from token file hive.metastore.token.signature
cp -Rp /etc/spark3 /home/nfraison
vi spark3/conf/hive-site.xml
# Add this property
>  <property>
>      <name>hive.metastore.token.signature</name>
>      <value>DelegationTokenForHiveMetaStoreServer</value>
>  </property>
export SPARK_CONF_DIR=/home/nfraison/spark3/conf
pyspark3
> In [2]: spark.sql("select * from wmf.geoeditors_monthly limit 1000").schema
> Out[2]: StructType(List(StructField(wiki_db,StringType,true),StructField(country_code,StringType,true),StructField(users_are_anonymous,BooleanType,true),StructField(activity_level,StringType,true),StructField(distinct_editors,LongType,true),StructField(namespace_zero_distinct_editors,LongType,true),StructField(month,StringType,true)))

Security aspect to tacle:

  • Users should be able to issue their secret data and spark app on the k8s cluster
  • At least secret data should not be readable by other users through kubectl command or through a spark app running inside the cluster (for ex. launching a sparkapplication that rely on other user secret)

Idea for secret management/access rights per user:

Rely on sub namespace: https://github.com/kubernetes-sigs/hierarchical-namespaces
For ex. rely on one sub-namespace per user with appropriate RBAC so only the owner of the sub-namespace can read/modify object in it with the spark-operator/spark service account

One root NS: spark
One sub-NS per user
Appropriate RBAC for spark-operator at spark layer so the operator can start sparkapplication within sub-NS.
One spark-<username> account per user sub-NS with appropriate RBAC to run spark app within the sub-NS (read all secrets on that sub-NS + configmap/secret from spark NS + pods + nodes)
One RBAC authorizing user to interact in their respective sub-NS: RW secrets and sparkappplication + R pods. If authorizing user is not possible we can create a dedicated spark-deploy-<username> per sub-NS with same rights
Puppet will be in charge of provisioning sub-NS and dedicated svc account through an extract of users belonging to a specific group

Is the operator able to scan for those sub-NS (should it require scan on spark or to declare all sub-NS individually and restart for any change of sub-NS)?

Rely on https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/ or on validating admission webhook https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/.
For ex. validating admission webhook which validate access to secret/sparkapp objects depending to username accessing the object and objectname (if you are user nfraison you should only try to access nfraison_secret_* or nfraison_sparkapp_*)

One NS: spark
One validating admission webhook which will ensure that any secret/sparkapplication within spark NS are created according the rules:

  • if RW from admin user (SRE) => allowed: true
  • if RW from spark-<username> svc account and naming match username => allowed: true
  • if RW from regular user and naming match username => allowed: true
  • otherwise => allowed: false

One spark-<username> service account per user with appropriate RBAC to run spark app within spark namespace
To ensure that svc account used to run the sparkapplication is the user one we can add a check on the validation admission webhook or create a mutation admission webhook which will override the service account field in sparkapplication with spark-<username> and potentially add some dedicated labels.
This mutation admission webhook could also be good to use with sub-NS approach to enforce a set of label/others

Puppet will be in charge of provisioning dedicated svc account through an extract of users belonging to a specific group

Look at https://github.com/open-policy-agent/gatekeeper, https://www.openpolicyagent.org/docs/latest/kubernetes-introduction/, https://kyverno.io/docs/introduction/ that rely on that validating/mutation admission control

Just rely spak-<username> NS per user
While the sub-namespace mechanism provide nice inheritance feature and potentially nicer view when running some kubectl get ns command it is possible to achieve the same by having one spark-<username> NS per user and apporpriate RBAC for the user/svc user and for the spark-operator user

The idea here is to create one NS per app deployed => I would not go to that one so far as it can leads to quite a lots of NS for our use case of running spark app + some cleanup to manage

Resources Quota is not addressed here but from this doc https://kubernetes.io/docs/concepts/policy/resource-quotas/ it seems better to have some sub-NS mechanism as it rely on a quota per NS so we could ensure no user take all k8s cluster resources. If relying on validating admission webhook within spark NS we could rely on PriorityClass and create one per user but looks artificial.

Test relying on multiple NS:

  • spark-operator
  • spark-nfraison
  • spark-btullis

With mulitple svc account

  • spark-operator, ns: spark-operator
  • spark-nfraison-run, ns: spark-nfraison
  • spark-btullis-run, ns: spark-btullis
  • spark-nfraison-deploy, ns: spark-nfraison
  • spark-btullis-deploy, ns: spark-btullis

As expected with appropriate RBAC the operator can control launch of spark app from multiple spark-* NS, user can only deploy to it's own NS relying on spark-<username>-deploy svc account:

apiVersion: v1
kind: Config
preferences: {}
current-context: minikube-nfraison
clusters:
- cluster:
    certificate-authority: /home/nfraison/.minikube/ca.crt
    extensions:
    - extension:
        provider: minikube.sigs.k8s.io
        version: v1.29.0
      name: cluster_info
    server: https://192.168.49.2:8443
  name: minikube-nfraison
contexts:
- context:
    cluster: minikube-nfraison
    extensions:
    - extension:
        provider: minikube.sigs.k8s.io
        version: v1.29.0
      name: context_info
    namespace: spark-nfraison
    user: spark-nfraison-deploy
  name: minikube-nfraison
users:
- name: spark-nfraison-deploy
  user:
    token: eyJhbGciOiJSUzI1NiIsImtpZCI6IkxVbkZIMVNHWTdHUkxYcTRHNXJiRHRfbG5VRXc0NEVfN2h1TVZKd3RPUlkifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNjc3NDk4MzExLCJpYXQiOjE2Nzc0OTQ3MTEsImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJzcGFyay1uZnJhaXNvbiIsInNlcnZpY2VhY2NvdW50Ijp7Im5hbWUiOiJzcGFyay1uZnJhaXNvbi1kZXBsb3kiLCJ1aWQiOiJhYTk2YjVkZi1jMWIwLTQxZGYtYmM1NS1jNzEwMTRlNmU5YmIifX0sIm5iZiI6MTY3NzQ5NDcxMSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OnNwYXJrLW5mcmFpc29uOnNwYXJrLW5mcmFpc29uLWRlcGxveSJ9.n9aMv2aBoy5Aa9dws3dVOMsrC7WAdPlhr8Md30PKuEcLLUU-HmslpV-g7wNMNaUKSvx1q4dVIzlrGzANptMCPBbYfLIM2TFagnGVAsOhW1Vmjas-UJ5hDW69vySM2YtrCkDY36PYqHQln5tL6k7xsI_ScQnoJZnX2E5GAoQQe_k0Pyv2xsBYw-X60oz14HmsmxMkGCGEaAc1z02SShw2qUQLUaXLLkq608wojMtZCuWv1_A640s0kLmmFYQ4e2WEss2iORvr_YJvnNcFwIgotUNu1S-NY60YNwEYmnG-nX-dkc0_i0hAa5DLdE-kwlBz36pbHxB64in8jWgGtnzbSg
  1. Get token for user spark-nfraison-deploy
  2. kubectl create token spark-nfraison-deploy
  3. eyJhbGciOiJSUzI1NiIsImtpZCI6IkxVbkZIMVNHWTdHUkxYcTRHNXJiRHRfbG5VRXc0NEVfN2h1TVZKd3RPUlkifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNjc3NDk3OTkzLCJpYXQiOjE2Nzc0OTQzOTMsImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJzcGFyay1uZnJhaXNvbiIsInNlcnZpY2VhY2NvdW50Ijp7Im5hbWUiOiJzcGFyay1uZnJhaXNvbi1kZXBsb3kiLCJ1aWQiOiJhMDBiZDcwZC1iZWUxLTQxZTgtOTgyZi0wZjRhMWQ4OTc2ZTEifX0sIm5iZiI6MTY3NzQ5NDM5Mywic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OnNwYXJrLW5mcmFpc29uOnNwYXJrLW5mcmFpc29uLWRlcGxveSJ9.2NKKuUJ9V1ykcOVyNINktwMp1BT_IRR3vP0sakbghU1Yi06_WdShVFei12AZjOzu4BTkZujVimK5flslanoedFZDuLCWYMeGaJj5HqkmvwU86x-iG6L_kp2ilJlCmpLCnWOo7lNvNX9qYksKgGR0YshwgtQsqtdQewO0yfbG_X8yjph5leGni6chr8G4LU-fFssnEdkyqUvByZJ6XLoFn1N5LQx2Fr0juvCNSx1S6GLcK7UjlseS2fm9hdYzqUcCQoBGY6uQJtPYRjRLyjBSzGMBLQrJOS_Gpbh7PXyDeiG7fN37kGh02dI1tFkxFeflyLDGB1uEbte7BIEX_RKbdg

Ex. of user RBAC (deploy + run):


Ex. of app deployment:

Test relying on hierarchical-namespaces aka hns.
2 NS

  • spark
  • spark-operator

2 sub ns on spark

  • spark-nfraison
  • spakr-btullis

With mulitple svc account

  • spark-operator, ns: spark-operator
  • spark-nfraison-run, ns: spark-nfraison
  • spark-btullis-run, ns: spark-btullis
  • spark-nfraison-deploy, ns: spark-nfraison
  • spark-btullis-deploy, ns: spark-btullis

Deployment on minikube cls and on kkubectl command

# Select the latest version of HNC
HNC_VERSION=v1.0.0

# Install HNC. Afterwards, wait up to 30s for HNC to refresh the certificates on its webhooks.
kubectl apply -f https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/download/${HNC_VERSION}/default.yaml

# Need first to install krew https://krew.sigs.k8s.io/docs/user-guide/setup/install/
kubectl krew update && kubectl krew install hns

Creation of the spark NS and the 2 sub-NS

k create ns spark
k hns  create spark-nfraison -n spark
k hns  create spark-btullis -n spark

From kube point of view those sub-NS are still seen as standard NS

k get ns
#NAME              STATUS   AGE
...
#spark             Active   98s
#spark-btullis     Active   74s
#spark-nfraison     Active   80s
#spark-operator    Active   9m44s

A dedicated operator deployed in hnc-system managed sub-NS. Dependency between namespace is managed by SubnamespaceAnchor objects

> k get SubnamespaceAnchor
#NAME             AGE
#spark-btullis    3m31s
#spark-nfraison   78s

> k get SubnamespaceAnchor -n spark spark-nfraison -o yaml
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
  finalizers:
  - hnc.x-k8s.io
  name: spark-nfraison
  namespace: spark

The hns kubectl plugin permits to interact with sub NS without having to create manually SubnamespaceAnchor object and display tree of NS

k hns tree -A
#default
#hnc-system
#kube-node-lease
#kube-public
#kube-system
#spark
#├── [s] spark-btullis
#└── [s] spark-nfraison
#spark-operator
k hns tree spark
#spark
#├── [s] spark-btullis
#└── [s] spark-nfraison

Finally the right management is manage as with previous test relying on multiple NS as sub-NS are just standard NS.
This means that the plugin will provide tree view and rights inheritance for the spark operator (we can give him rights to interact with spark NS and it will provide same rights to sub-NS of spark NS).
For ex. this spark-user-role role is only define on the spark ns and automatically propagated to other sub-NS

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-user-role
  namespace: spark
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - persistentvolumeclaims
  verbs:
  - "*"
- apiGroups:
  - sparkoperator.k8s.io
  resources:
  - sparkapplications
  - sparkapplications/status
  - scheduledsparkapplications
  - scheduledsparkapplications/status
  verbs:
  - "*"
(venv) :nfraison@pop-os:~/dev/wikimedia/scripts/k8s/spark-hnc$  [master L|● 17✚ 15…16] 
14:04 $ k get roles -n spark
NAME              CREATED AT
spark-user-role   2023-02-27T13:04:27Z
(venv) :nfraison@pop-os:~/dev/wikimedia/scripts/k8s/spark-hnc$  [master L|● 17✚ 15…16] 
14:04 $ k get roles -n spark-nfraison
NAME              CREATED AT
spark-user-role   2023-02-27T13:04:27Z

This can also be done easily with appropriate helm chart without relying on this operator

Test with validating admission webhook relying on gatekeeper engine
2 NS

  • spark-operator
  • spark

With mulitple svc account

  • spark-operator, ns: spark-operator
  • spark-nfraison-run, ns: spark
  • spark-btullis-run, ns: spark
  • spark-nfraison-deploy, ns: spark
  • spark-btullis-deploy, ns: spark

First need a dedicated operator to manage validating-webhook (constraints)

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/deploy/gatekeeper.yaml

Then we need to define template-constraints and constraints


Those constraint ensure that only some secret named hdfs-token-<username> are created by the user and that the sparkapp as well define the spark-run-<username> serviceaccount to run executor and executor only use some secret ending with username

Validating mutation webhook are only applicable when creating, updating, deleting object which means that we can't ensure a user from not reading another user secret we can only ensure that sparkapp submitted doesn't try to read for a secret of another user.

The gatekeeper solution only configure a ValidatingWebhookConfiguration got CREATE/UPDATE operations so we can't rely on it to ensure READ is only performed on appropriate secret.

Trying to update it failed, it seems that READ can not be checked by validating webhook:

The ValidatingWebhookConfiguration "gatekeeper-validating-webhook-configuration" is invalid: webhooks[0].rules[0].operations[2]: Unsupported value: "READ": supported values: "*", "CONNECT", "CREATE", "DELETE", "UPDATE"

From source code there is indeed no support for READ: https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/admissionregistration/types.go#L743

Test relying on vault and one spark namespace for application

User auth to vault and have RW rights to their specific user path: kubernetes/<username>
User push hadoop token in a kv inside that pah base64 encoded

vault secrets enable -path=kubernetes kv-v2
vault kv put kubernetes/nfraison/hadoop token="base64_encoded_token"

Service account running the spark driver and executor has read rights to that key

vault auth enable kubernetes
vault write auth/kubernetes/config \
    kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443"
vault policy write spark-run-nfraison - <<EOF
path " kubernetes/nfraison/hadoop" {
  capabilities = ["read"]
}
EOF
vault write auth/kubernetes/role/spark-run-nfraison \
    bound_service_account_names=spark-run-nfraison \
    bound_service_account_namespaces=spark \
    policies=spark-run-nfraison \
    ttl=24h

This Ensure that no other user can read modify the token and only the service account running the spark driver/executor pods will be able to read that secret.

Then in order to have the secret injected to the container we need to add those annotations to the spark driver and executor spec

annotations:
  vault.hashicorp.com/agent-inject: 'true'
  vault.hashicorp.com/agent-inject-status: 'update'
  vault.hashicorp.com/role: 'spark-run-nfraison'
  vault.hashicorp.com/agent-inject-secret-hadoop-token: 'kubernetes/data/nfraison/hadoop'
  vault.hashicorp.com/agent-inject-template-hadoop-token: |
    {{- with secret "kubernetes/data/nfraison/hadoop" -}}
    {{ .Data.data.token | base64Decode }}
    {{- end -}}

And to expose the env HADOOP_TOKEN_FILE_LOCATION variable to "/vault/secrets/hadoop-token"

Ew. of SparkApplication file

But there is still one issue with service account reading secret from vault and users assigning service account to use on their driver/executor.
For ex. user1 can decide to run it's spark driver/executor with spark-run-user2 which will lead to being able to read from user2 vault secret store.
One possible solution here is to rely on mutating admission webhook to enforce usage of spark-run-user1 service account for spark driver and executor

Question - is something like https://engineering.linkedin.com/blog/2020/open-sourcing-kube2hadoop taken into consideration for evaluation? It seems the same use case, I recall that we wanted to follow up but we never found the time.

Question - is something like https://engineering.linkedin.com/blog/2020/open-sourcing-kube2hadoop taken into consideration for evaluation? It seems the same use case, I recall that we wanted to follow up but we never found the time.

Yes. I think we are looking at it now.
There are some additional comments and suggestions on this doc and I believe that kube2hadoop is under consideration.

@elukey could you please review https://docs.google.com/document/d/1Aub7lUr1nPGN3MXz8FI7CCCZ5a5Y1BRpY3poVmui6AM/edit# with our proposal for hadoop access mechanism for spark jobs on K8S

Rolling to next sprint.
Next step: Team review complete by Friday, March 17, 2023.

Following review, we will determine which solution to pursue.
Testing the PoC for handling the kerberos problem is also dependent on getting this ticket done, https://phabricator.wikimedia.org/T331859 (Enable egress traffic from spark pods to HDFS and HIVE). Whichever solution is chosen for the PoC.

Make sure decision is logged on the main board.

Send messages on #wikimedia-serviceops IRC channel to have some reviews from sre and ensure the vault mechanism chosen is acceptable or not

BTullis renamed this task from DSE Experiment - PoC how to Address Kerberos from spark running on DSE K8S cluster to Research and test methods for accessing kerberized services from spark running on the DSE K8S cluster.Mar 24 2023, 2:33 PM
BTullis updated the task description. (Show Details)
BTullis triaged this task as High priority.
BTullis updated the task description. (Show Details)

Marking this ticket as Done.

We have a draft document: Access to hadoop platform from spark running on DSE K8S cluster - that I will continue to refine as we develop the solution.

We can execute test Spark jobs that access HDFS and Hive. There are some notes here that I will write up as we continue development and testing.

We also have an early-stage CLI application for users to submit jobs to the DSE cluster: spark8s

I believe that the next steps will involve a more wide-ranging review of potential secrets management systems for kubernetes.
The proposed solution in the document selected vault as the proposed solution, but the consensus of those reviewing the document seems to be that perhaps we should carry out this review of potential solutions more thoroughtly, before making a decision.

BTullis moved this task from Incoming to Needs Reporting on the Data-Platform-SRE board.