The Machine Learning team wants to start to use airflow to schedule creating datasets and models.
We would like our own instance on the dse-k8s cluster similar to what is set up for other teams (assuming that dse-k8s is the suggested way moving forward).
To start with we want the team to get acquainted with the processes and tools, have a place to create pipelines(DAGs) and experiment building datasets and models.
Access to a GPU is not necessary initially, and we can discuss later how to configure the new ml-train machines with Airflow.
Description
Details
Related Objects
Event Timeline
Thanks @isarantopoulos - I'm sure that we can help you, here.
We would like our own instance on the dse-k8s cluster similar to what is set up for other teams (assuming that dse-k8s is the suggested way moving forward).
Yes, dse-k8s is definitely the suggested way forward.
We will be able to skip past a VM based Airflow instance as well as a metadata database on the shared bare-metal postgresql servers.
However, you've caught us at a rather interesting time too, because we're about to start switching our existing instances from the LocalExecutor to the KubernetesExecutor in T364389: Migrate the airflow scheduler components to Kubernetes.
So it probably makes sense for your instance to start out with the KubernetesExecutor, as the test_k8s instance does.
In addition to that, as you're looking at new workloads, rather than migrating existing DAGs, you might be interested in the work on the alternative operators that we're looking at, such as:
I imagine that the KubernetesPodOperator would be particularly useful for the GPU based workloads, but you will probably know best.
If you prefer to use the Hadoop cluster and YARN workers for your distributed workloads, then you can keep a close eye on our work to support the Spark and Skein jobs on YARN in T364389.
You will also have the BashOperator and PythonOperator readily available, if these meet your needs.
We can probably get your instance up and running fairly quickly, but we are also quite busy with migrating the exsiting instances. What is the sort of timetable that you would prefer, to get started working with your instance?
Thanks for the info. The Spark operator seems like a great choice but we'll get to that when we actually start developing stuff. It also depends on the resources available on the dse-k8s cluster vs YARN.
What is the sort of timetable that you would prefer, to get started working with your instance?
We would like to have something set up so we can work on next quarter. In the meantime we could get away with local development + docker. Could we also use the test_k8s instance for dev/testing?
Change #1102249 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/dns@master] airflow-ml: define DNS records
Change #1102254 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow-ml: define helmfile and values
Change #1102255 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] deployment_server: define airflow-ml users
Change #1102256 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] airflow-ml: define ATS mapping rules and cache settings
Change #1102257 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] airflow-ml: define CAS config
Change #1102258 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] openldap: define new offloaded airflow-ml-ops group
brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/967
Draft: Define an ml DAG folder
brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=postgresql-airflow-ml --display-name="postgresql-airflow-ml" { "user_id": "postgresql-airflow-ml", "display_name": "postgresql-airflow-ml", "email": "", "suspended": 0, "max_buckets": 1000, "subusers": [], "keys": [ { "user": "postgresql-airflow-ml", "access_key": "REDACTED", "secret_key": "REDACTED" } ], "swift_keys": [], "caps": [], "op_mask": "read, write, delete", "default_placement": "", "default_storage_class": "", "placement_tags": [], "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 }, "user_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 }, "temp_url_keys": [], "type": "rgw", "mfa_ids": [] } brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=airflow-ml --display-name="airflow-ml" { "user_id": "airflow-ml", "display_name": "airflow-ml", "email": "", "suspended": 0, "max_buckets": 1000, "subusers": [], "keys": [ { "user": "airflow-ml", "access_key": "REDACTED", "secret_key": "REDACTED" } ], "swift_keys": [], "caps": [], "op_mask": "read, write, delete", "default_placement": "", "default_storage_class": "", "placement_tags": [], "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 }, "user_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 }, "temp_url_keys": [], "type": "rgw", "mfa_ids": [] }
brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://postgresql-airflow-ml.dse-k8s-eqiad Bucket 's3://postgresql-airflow-ml.dse-k8s-eqiad/' created brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://logs.airflow-ml.dse-k8s-eqiad Bucket 's3://logs.airflow-ml.dse-k8s-eqiad/' created
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey airflow/airflow-ml.discovery.wmnet@WIKIMEDIA brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA brouberol@krb1001:~$ sudo kadmin.local ktadd -norandkey -k analytics-ml.keytab airflow/airflow-ml.discovery.wmnet@WIKIMEDIA HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA Entry for principal airflow/airflow-ml.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:analytics-ml.keytab. Entry for principal HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:analytics-ml.keytab.
No analytics user is defined for ML for now, so we'll just include the 2 principals related to the HTTP API, for now.
Change #1102268 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/967
Define an ml DAG folder
Change #1102249 merged by Brouberol:
[operations/dns@master] airflow-ml: define DNS records
Change #1102255 merged by Brouberol:
[operations/puppet@production] deployment_server: define airflow-ml users
Change #1102280 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow-ml: define kubernetes namespace
Change #1102280 merged by jenkins-bot:
[operations/deployment-charts@master] airflow-ml: define kubernetes namespace
Change #1102268 merged by jenkins-bot:
[operations/deployment-charts@master] airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces
Change #1102254 merged by jenkins-bot:
[operations/deployment-charts@master] airflow-ml: define helmfile and values
Change #1102257 merged by Brouberol:
[operations/puppet@production] airflow-ml: define CAS config
Change #1102308 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow-ml: fix typo
Change #1102308 merged by Brouberol:
[operations/deployment-charts@master] airflow-ml: fix typo
The following group members will get the Op Airflow role: https://ldap.toolforge.org/group/airflow-ml-ops
Both airflow and the cloudnative PG cluster were deployed
brouberol@deploy2002:~$ kubectl get pod -n airflow-ml NAME READY STATUS RESTARTS AGE airflow-gitsync-d76b48d7f-d6tpg 1/1 Running 0 57s airflow-kerberos-bcf697468-hmhwm 1/1 Running 0 57s airflow-scheduler-85b68dc7fd-zh86v 2/2 Running 0 57s airflow-webserver-56767b7f46-g2gpx 2/2 Running 0 57s postgresql-airflow-ml-1 1/1 Running 0 9m6s postgresql-airflow-ml-2 1/1 Running 0 8m22s postgresql-airflow-ml-pooler-rw-7d5d74cb69-4zg5x 1/1 Running 0 9m20s postgresql-airflow-ml-pooler-rw-7d5d74cb69-m2x6s 1/1 Running 0 9m20s postgresql-airflow-ml-pooler-rw-7d5d74cb69-zcgkt 1/1 Running 0 9m20s
Change #1102258 merged by Brouberol:
[operations/puppet@production] openldap: define new offloaded airflow-ml-ops group
Change #1102256 merged by Brouberol:
[operations/puppet@production] airflow-ml: define ATS mapping rules and cache settings
All done!
The URL is https://airflow-ml.wikimedia.org
For all of y'all who are new to how we manage airflow in kubernetes, here's the main documentation page: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes
Any change you make under https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/ml/dags will be automatically reflected in your airflow instance within 5 minutes after the MR is merged.
Have fun!
Thanks @brouberol !
I'm getting an error when check the k8s namespace for ariflow-ml.
kube_env airflow-ml dse-k8s-eqiad You don't have permission to read the configuration for airflow-ml/dse-k8s-eqiad (try sudo)
I assume there is no such configuration atm, is that right?
Ah, it does, but it's owned by root and analytics-deployers, which you might not be a member of. Let me see whether I can adjust the ownership.
@isarantopoulos I've temporarily chown the user config files to root:deploy-ml-service. Can you confirm that it works better for you? If so, I'll make the change permanent. Thanks!
Ah yes, I was just about to say that maybe deploy-ml-service would be the right group here.
In one sense, maybe @isarantopoulos doesn't actually need deployment access.
The only thing that I can think of would be using the airflow cli, as per: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_airflow_CLI
Everything else should be GitOps, based on a merge to main in airflow-dags. But maybe I'm missing something.
I can now use the configuration but it throws an error:
isaranto@deploy2002:~$kube_env airflow-ml dse-k8s-eqiad isaranto@deploy2002:~$ kubectl get pods Error in configuration: * unable to read client-cert /etc/kubernetes/pki/dse__airflow-ml.pem for airflow-ml due to open /etc/kubernetes/pki/dse__airflow-ml.pem: permission denied * unable to read client-key /etc/kubernetes/pki/dse__airflow-ml-key.pem for airflow-ml due to open /etc/kubernetes/pki/dse__airflow-ml-key.pem: permission denied
We have used analytics-deployers for all of the other instances, but even then I think that it's mainly Data-Platform-SRE who would need to interact with these namespaces.
I'm getting an error when check the k8s namespace for airflow-ml.
@isarantopoulos have you got something specific in mind that you would like to use kubectl to achieve?
Maybe there is a way of doing the same from the UI, or from the airflow cli.
I'm not trying to be mean with privileges, just working on getting the right level of functionality for you. You're the first team who has onboarded directly to a k8s-based instance.
@BTullis It just struck me that you need access to the -deploy user credentials to use the airflow CLI, as you need to exec into an airflow container to run it.
Yes, that's true, but maybe that should be a driver for us to implement a better way of accessing the airflow cli, rather than relaxing the permissions on the namespaces. Just thining aloud, really. Probably not something we need to fix today, either.
I was just following the documentation to see if everything works. At the moment I don't need need it for anything specific just to be able to explore things and debug anything that may come up.
I assumed that since it is our instance and we're not interfering with anybody else's work we would have ownership on that.
All good for now, we can revisit privileges when we start working on this if such a requirement arises.
thank you both!