Page MenuHomePhabricator

Create a helm chart that manages the resources required for running useful spark jobs on kubernetes
Closed, ResolvedPublic

Description

We now have the spark-operator version 2.2.1 running on the dse-k8s-eqiad cluster.

There are two configured spark job namesapces (spark and analytics-test) where any created SparkApplication objects will be handled by the operator and executed as spark jobs.
The spark-operator helm chart creates several objects in each of these spark job namespaces. These enable the basic functionality of the spark-operator and include:

RBAC Roles

  • role.rbac.authorization.k8s.io/production-spark-operator-controller
  • role.rbac.authorization.k8s.io/production-spark-operator-webhook
  • role.rbac.authorization.k8s.io/production-spark-operator-spark

RBAC Rolebindings

  • rolebinding.rbac.authorization.k8s.io/production-spark-operator-controller
  • rolebinding.rbac.authorization.k8s.io/production-spark-operator-webhook
  • rolebinding.rbac.authorization.k8s.io/production-spark-operator-spark

A serviceaccount

  • serviceaccount/production-spark-operator-spark

Network policies - Note that these networkpolicy objects are WMF additions to the spark-operator chart

  • networkpolicy.crd.projectcalico.org/spark-driver-k8s-api
  • networkpolicy.crd.projectcalico.org/spark-executor-to-driver
  • networkpolicy.crd.projectcalico.org/spark-executor-to-executor
  • networkpolicy.crd.projectcalico.org/spark-operator-webhook-to-driver

With these resources available, we can run a self-contained spark job such as the sparkPi example.
However, we cannot yet to the following:

  • Authenticate using kerberos
  • Connect to the hive metastore
  • Connect to the HDFS file system
  • Connect to our Ceph/S3 rados gateway
  • Experiment with running spark-submit by-hand

Enabling this functionality will require more kubernetes resources to be installed into each of the spark job namespaces.
These will include:

  • Network policies allowing egress to:
    • Kerberos KDCs
    • The Hive metastore
    • The HDFS nameservers and datanodes
    • The Ceph/S3 rados gateway
  • Configmaps defining
    • Common hadoop configuration
    • Common spark configuration
  • Secrets containing:
    • A Kerberos keytab (containing one or more principals)
    • Credentials for accessing the Ceph/S3 service
  • Utility pod(s)
    • A pod running a spark image, in which engineers can start a shell

One option would be to add all of these resources to the spark-operator helm chart, in the way that we have added the networkpolicy objects.
However, since we have recently switched to using the upstream chart, we would ideally like to keep modifications to this chart to a minimum.

Therefore, if we create a new chart that managees the resources we need for this, we can deploy a release of this chart into each spark-enabled namespace.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+40 -3
operations/deployment-chartsmaster+47 -9
operations/deployment-chartsmaster+8 -4
operations/deployment-chartsmaster+13 -4
operations/deployment-chartsmaster+91 -1
operations/deployment-chartsmaster+4 -1
operations/deployment-chartsmaster+6 -7
operations/deployment-chartsmaster+8 -1
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+46 -2
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+2 -5
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+4 -3
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+38 -38
operations/deployment-chartsmaster+1 -67
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+106 -0
operations/deployment-chartsmaster+783 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I added a new kerberos principal for the analytics-test namespace, using the following command:

root@krb1002:~# kadmin.local addprinc -randkey analytics/analytics-test.discovery.wmnet@WIKIMEDIA

I then created a keytab for this principal, using these commands:

root@krb1002:~# mkdir -p /srv/kerberos/keytabs/analytics-test.discovery.wmnet/analytics/

root@krb1002:~# kadmin.local ktadd -norandkey -k /srv/kerberos/keytabs/analytics-test.discovery.wmnet/analytics/analytics.keytab analytics/analytics-test.discovery.wmnet@WIKIMEDIA
Entry for principal analytics/analytics-test.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/srv/kerberos/keytabs/analytics-test.discovery.wmnet/analytics/analytics.keytab.

This keytab goes into the private repo, but as a base64 encoded representation of the file.

root@krb1002:~# cat /srv/kerberos/keytabs/analytics-test.discovery.wmnet/analytics/analytics.keytab|base64

I then put the output from that command into hieradata/role/common/deployment_server/kubernetes.yaml in the private repo, so that it will be available for helmfile deploys to the analytics-test namespace.

Change #1195178 merged by jenkins-bot:

[operations/deployment-charts@master] Add a new spark-support chart

https://gerrit.wikimedia.org/r/1195178

Change #1195182 merged by jenkins-bot:

[operations/deployment-charts@master] Add a deployment of the spark-support chart to our analytics-test namespace

https://gerrit.wikimedia.org/r/1195182

Change #1211712 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the helmfile values paths for the analytics-test spark-support

https://gerrit.wikimedia.org/r/1211712

Change #1211712 merged by jenkins-bot:

[operations/deployment-charts@master] Update the helmfile values paths for the analytics-test spark-support

https://gerrit.wikimedia.org/r/1211712

I'm making good progress on this now, although it is primarily in support of T410017: Provide a Spark production access for dbt with Airflow so the details are a little different from what was originally stated.

The spark-support chart currently configures the following, in each namespace where it is applied:

  • A hadoop configmap object, containing:
    • core-site.xml
    • hdfs-site.xml
    • hive-site.xml
  • A spark configmap object, containing:
    • hive-site.xml
    • spark-defaults.conf
  • A kerberos configmap object, containing:
    • krb5.conf
  • A role object, permitting:
    • All actions on pods
    • Various read-only operations on events, cronjobs, pod logs etc.

Then the following objects can be defined within the helmfile values, for each of the namespaces where the chart is deployed:

  • One or more keytabs, deployed within a secret object Each keytab may include one or more principals.
  • The external-service network policy objects, which allow spark jobs to reach out to services outside of kubernetes

So far, this has been deployed to the analytics-test namespace, and we have configured the following:

  • A single analytics keytab, containing the analytics/analytics-test.discovery.wmnet@WIKIMEDIA principal
  • External services for Hive, Hadoop on the test cluster. plus Kerberos
  • All configuration files reference the hadoop test cluster

I'm currently testing this with a spark_thrift_dbt DAG that will launch a spark-thriftserver sidecar, alongside a dbt-jobs container.

After a little testing, I will also deploy the spark-support chart to the analytics namespace and configure it for the production cluster.

Change #1214020 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove unnecessary and/or incorrect hadoop/spark config options

https://gerrit.wikimedia.org/r/1214020

Change #1214021 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Sort the hadoop/spark config items alphabetically

https://gerrit.wikimedia.org/r/1214021

Change #1214020 merged by jenkins-bot:

[operations/deployment-charts@master] Remove unnecessary and/or incorrect hadoop/spark config options

https://gerrit.wikimedia.org/r/1214020

Change #1214021 merged by jenkins-bot:

[operations/deployment-charts@master] Sort the hadoop/spark config items alphabetically

https://gerrit.wikimedia.org/r/1214021

Change #1214130 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Correct an error in the selector for external-services in analytics-test

https://gerrit.wikimedia.org/r/1214130

Change #1214130 merged by jenkins-bot:

[operations/deployment-charts@master] Correct an error in the selector for external-services in analytics-test

https://gerrit.wikimedia.org/r/1214130

Change #1214135 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Correct the external-services definition for analytics-test

https://gerrit.wikimedia.org/r/1214135

Change #1214135 merged by jenkins-bot:

[operations/deployment-charts@master] Correct the external-services definition for analytics-test

https://gerrit.wikimedia.org/r/1214135

Change #1214141 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Use the correct calico selector syntax for analytics-test

https://gerrit.wikimedia.org/r/1214141

Change #1214141 merged by jenkins-bot:

[operations/deployment-charts@master] Use the correct calico selector syntax for analytics-test

https://gerrit.wikimedia.org/r/1214141

Change #1214492 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add kerberos related configuration to the spark-defaults.conf file

https://gerrit.wikimedia.org/r/1214492

Change #1214492 merged by jenkins-bot:

[operations/deployment-charts@master] Add kerberos related configuration to the spark-defaults.conf file

https://gerrit.wikimedia.org/r/1214492

Change #1215627 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove incorrect hive.server2 settings and correct the k8s URL

https://gerrit.wikimedia.org/r/1215627

Change #1215627 merged by jenkins-bot:

[operations/deployment-charts@master] Remove incorrect hive.server2 settings and correct the k8s URL

https://gerrit.wikimedia.org/r/1215627

Change #1217145 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the spark master parameter to use the correct option

https://gerrit.wikimedia.org/r/1217145

Change #1217145 merged by jenkins-bot:

[operations/deployment-charts@master] Update the spark master parameter to use the correct option

https://gerrit.wikimedia.org/r/1217145

Change #1217183 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add two basic spark pod templates in a configmap

https://gerrit.wikimedia.org/r/1217183

Change #1217183 merged by jenkins-bot:

[operations/deployment-charts@master] Add two basic spark pod templates in a configmap

https://gerrit.wikimedia.org/r/1217183

Change #1217766 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Allow the spark serviceaccount to manage PVCs

https://gerrit.wikimedia.org/r/1217766

Change #1217766 merged by jenkins-bot:

[operations/deployment-charts@master] Allow the spark serviceaccount to manage PVCs

https://gerrit.wikimedia.org/r/1217766

I'm creating an S3 user called analytics-test.

btullis@cephosd1001:~$ sudo radosgw-admin user create --uid=analytics-test --display-name="Analytics Test"
{
    "user_id": "analytics-test",
    "display_name": "Analytics Test",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "analytics-test",
            "access_key": "redacted",
            "secret_key": "redacted"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}

Change #1217792 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add resources to the spark executor pod template

https://gerrit.wikimedia.org/r/1217792

Change #1217792 merged by jenkins-bot:

[operations/deployment-charts@master] Add resources to the spark executor pod template

https://gerrit.wikimedia.org/r/1217792

Change #1218219 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the network policies that are deployed by the spark-operator

https://gerrit.wikimedia.org/r/1218219

Change #1218219 merged by jenkins-bot:

[operations/deployment-charts@master] Update the network policies that are deployed by the spark-operator

https://gerrit.wikimedia.org/r/1218219

Change #1218225 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add more entries to the spark-defaults.conf file

https://gerrit.wikimedia.org/r/1218225

Change #1218225 merged by jenkins-bot:

[operations/deployment-charts@master] Add more entries to the spark-defaults.conf file

https://gerrit.wikimedia.org/r/1218225

Change #1218318 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a spark-toolbox pod

https://gerrit.wikimedia.org/r/1218318

Change #1218318 merged by jenkins-bot:

[operations/deployment-charts@master] Add a spark-toolbox pod

https://gerrit.wikimedia.org/r/1218318

Change #1218336 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the resources available to the spark-toolbox pod

https://gerrit.wikimedia.org/r/1218336

Change #1218336 merged by jenkins-bot:

[operations/deployment-charts@master] Increase the resources available to the spark-toolbox pod

https://gerrit.wikimedia.org/r/1218336

Change #1218748 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Allow the spark serviceaccount to manage secrets within the namespace

https://gerrit.wikimedia.org/r/1218748

Change #1218748 merged by jenkins-bot:

[operations/deployment-charts@master] Allow the spark serviceaccount to manage secrets within the namespace

https://gerrit.wikimedia.org/r/1218748

Change #1218778 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update spark/hadoop mountpoints and environment variables

https://gerrit.wikimedia.org/r/1218778

Change #1218778 merged by jenkins-bot:

[operations/deployment-charts@master] Update spark/hadoop mountpoints and environment variables

https://gerrit.wikimedia.org/r/1218778

Change #1219845 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a secret object to the spark-support chart

https://gerrit.wikimedia.org/r/1219845

Change #1219855 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Allow the analytics-test namespace to access the s3 endpoint in eqiad

https://gerrit.wikimedia.org/r/1219855

Change #1219845 merged by jenkins-bot:

[operations/deployment-charts@master] Add S3 support to the spark-support chart

https://gerrit.wikimedia.org/r/1219845

Change #1219855 merged by jenkins-bot:

[operations/deployment-charts@master] Allow the analytics-test namespace to access the s3 endpoint in eqiad

https://gerrit.wikimedia.org/r/1219855

Change #1219886 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the kerberos principal used by spark in the analytics-test namespace

https://gerrit.wikimedia.org/r/1219886

Change #1219886 merged by jenkins-bot:

[operations/deployment-charts@master] Update the kerberos principal used by spark in the analytics-test namespace

https://gerrit.wikimedia.org/r/1219886

I believe that this is all done now.