Page MenuHomePhabricator

Create an Airflow instance for ML
Closed, ResolvedPublic

Description

The Machine Learning team wants to start to use airflow to schedule creating datasets and models.
We would like our own instance on the dse-k8s cluster similar to what is set up for other teams (assuming that dse-k8s is the suggested way moving forward).
To start with we want the team to get acquainted with the processes and tools, have a place to create pipelines(DAGs) and experiment building datasets and models.
Access to a GPU is not necessary initially, and we can discuss later how to configure the new ml-train machines with Airflow.

Event Timeline

isarantopoulos renamed this task from Create an aiflow instance for ML to Create an Airflow instance for ML.Nov 19 2024, 10:29 AM
BTullis moved this task from Incoming to SRE on the Data-Platform board.
BTullis subscribed.

Thanks @isarantopoulos - I'm sure that we can help you, here.

We would like our own instance on the dse-k8s cluster similar to what is set up for other teams (assuming that dse-k8s is the suggested way moving forward).

Yes, dse-k8s is definitely the suggested way forward.

We will be able to skip past a VM based Airflow instance as well as a metadata database on the shared bare-metal postgresql servers.

However, you've caught us at a rather interesting time too, because we're about to start switching our existing instances from the LocalExecutor to the KubernetesExecutor in T364389: Migrate the airflow scheduler components to Kubernetes.

So it probably makes sense for your instance to start out with the KubernetesExecutor, as the test_k8s instance does.

In addition to that, as you're looking at new workloads, rather than migrating existing DAGs, you might be interested in the work on the alternative operators that we're looking at, such as:

I imagine that the KubernetesPodOperator would be particularly useful for the GPU based workloads, but you will probably know best.

If you prefer to use the Hadoop cluster and YARN workers for your distributed workloads, then you can keep a close eye on our work to support the Spark and Skein jobs on YARN in T364389.
You will also have the BashOperator and PythonOperator readily available, if these meet your needs.

We can probably get your instance up and running fairly quickly, but we are also quite busy with migrating the exsiting instances. What is the sort of timetable that you would prefer, to get started working with your instance?

Thanks for the info. The Spark operator seems like a great choice but we'll get to that when we actually start developing stuff. It also depends on the resources available on the dse-k8s cluster vs YARN.

What is the sort of timetable that you would prefer, to get started working with your instance?

We would like to have something set up so we can work on next quarter. In the meantime we could get away with local development + docker. Could we also use the test_k8s instance for dev/testing?

Gehel triaged this task as Medium priority.Nov 25 2024, 1:25 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Change #1102249 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/dns@master] airflow-ml: define DNS records

https://gerrit.wikimedia.org/r/1102249

Change #1102254 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-ml: define helmfile and values

https://gerrit.wikimedia.org/r/1102254

Change #1102255 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] deployment_server: define airflow-ml users

https://gerrit.wikimedia.org/r/1102255

Change #1102256 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] airflow-ml: define ATS mapping rules and cache settings

https://gerrit.wikimedia.org/r/1102256

Change #1102257 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] airflow-ml: define CAS config

https://gerrit.wikimedia.org/r/1102257

Change #1102258 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] openldap: define new offloaded airflow-ml-ops group

https://gerrit.wikimedia.org/r/1102258

brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=postgresql-airflow-ml --display-name="postgresql-airflow-ml"
{
    "user_id": "postgresql-airflow-ml",
    "display_name": "postgresql-airflow-ml",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "postgresql-airflow-ml",
            "access_key": "REDACTED",
            "secret_key": "REDACTED"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}

brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=airflow-ml --display-name="airflow-ml"
{
    "user_id": "airflow-ml",
    "display_name": "airflow-ml",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "airflow-ml",
            "access_key": "REDACTED",
            "secret_key": "REDACTED"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}
brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://postgresql-airflow-ml.dse-k8s-eqiad
Bucket 's3://postgresql-airflow-ml.dse-k8s-eqiad/' created
brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://logs.airflow-ml.dse-k8s-eqiad
Bucket 's3://logs.airflow-ml.dse-k8s-eqiad/' created
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey airflow/airflow-ml.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA
brouberol@krb1001:~$ sudo kadmin.local ktadd -norandkey -k analytics-ml.keytab airflow/airflow-ml.discovery.wmnet@WIKIMEDIA HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA
Entry for principal airflow/airflow-ml.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:analytics-ml.keytab.
Entry for principal HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:analytics-ml.keytab.

No analytics user is defined for ML for now, so we'll just include the 2 principals related to the HTTP API, for now.

Change #1102268 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces

https://gerrit.wikimedia.org/r/1102268

Change #1102249 merged by Brouberol:

[operations/dns@master] airflow-ml: define DNS records

https://gerrit.wikimedia.org/r/1102249

Change #1102255 merged by Brouberol:

[operations/puppet@production] deployment_server: define airflow-ml users

https://gerrit.wikimedia.org/r/1102255

Change #1102280 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-ml: define kubernetes namespace

https://gerrit.wikimedia.org/r/1102280

Change #1102280 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-ml: define kubernetes namespace

https://gerrit.wikimedia.org/r/1102280

Change #1102268 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces

https://gerrit.wikimedia.org/r/1102268

Change #1102254 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-ml: define helmfile and values

https://gerrit.wikimedia.org/r/1102254

Change #1102257 merged by Brouberol:

[operations/puppet@production] airflow-ml: define CAS config

https://gerrit.wikimedia.org/r/1102257

Change #1102308 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-ml: fix typo

https://gerrit.wikimedia.org/r/1102308

Change #1102308 merged by Brouberol:

[operations/deployment-charts@master] airflow-ml: fix typo

https://gerrit.wikimedia.org/r/1102308

Both airflow and the cloudnative PG cluster were deployed

brouberol@deploy2002:~$ kubectl get pod -n airflow-ml
NAME                                               READY   STATUS    RESTARTS   AGE
airflow-gitsync-d76b48d7f-d6tpg                    1/1     Running   0          57s
airflow-kerberos-bcf697468-hmhwm                   1/1     Running   0          57s
airflow-scheduler-85b68dc7fd-zh86v                 2/2     Running   0          57s
airflow-webserver-56767b7f46-g2gpx                 2/2     Running   0          57s
postgresql-airflow-ml-1                            1/1     Running   0          9m6s
postgresql-airflow-ml-2                            1/1     Running   0          8m22s
postgresql-airflow-ml-pooler-rw-7d5d74cb69-4zg5x   1/1     Running   0          9m20s
postgresql-airflow-ml-pooler-rw-7d5d74cb69-m2x6s   1/1     Running   0          9m20s
postgresql-airflow-ml-pooler-rw-7d5d74cb69-zcgkt   1/1     Running   0          9m20s

Change #1102258 merged by Brouberol:

[operations/puppet@production] openldap: define new offloaded airflow-ml-ops group

https://gerrit.wikimedia.org/r/1102258

Change #1102256 merged by Brouberol:

[operations/puppet@production] airflow-ml: define ATS mapping rules and cache settings

https://gerrit.wikimedia.org/r/1102256

All done!

Screenshot 2024-12-11 at 15.30.31.png (2×2 px, 304 KB)

The URL is https://airflow-ml.wikimedia.org

For all of y'all who are new to how we manage airflow in kubernetes, here's the main documentation page: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes

Any change you make under https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/ml/dags will be automatically reflected in your airflow instance within 5 minutes after the MR is merged.

Have fun!

Thanks @brouberol !
I'm getting an error when check the k8s namespace for ariflow-ml.

kube_env airflow-ml dse-k8s-eqiad
You don't have permission to read the configuration for airflow-ml/dse-k8s-eqiad (try sudo)

I assume there is no such configuration atm, is that right?

Ah, it does, but it's owned by root and analytics-deployers, which you might not be a member of. Let me see whether I can adjust the ownership.

@isarantopoulos I've temporarily chown the user config files to root:deploy-ml-service. Can you confirm that it works better for you? If so, I'll make the change permanent. Thanks!

Ah, it does, but it's owned by root and analytics-deployers, which you might not be a member of. Let me see whether I can adjust the ownership.

Ah yes, I was just about to say that maybe deploy-ml-service would be the right group here.

In one sense, maybe @isarantopoulos doesn't actually need deployment access.
The only thing that I can think of would be using the airflow cli, as per: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_airflow_CLI

Everything else should be GitOps, based on a merge to main in airflow-dags. But maybe I'm missing something.

I can now use the configuration but it throws an error:

isaranto@deploy2002:~$kube_env airflow-ml dse-k8s-eqiad
isaranto@deploy2002:~$ kubectl get pods
Error in configuration:
* unable to read client-cert /etc/kubernetes/pki/dse__airflow-ml.pem for airflow-ml due to open /etc/kubernetes/pki/dse__airflow-ml.pem: permission denied
* unable to read client-key /etc/kubernetes/pki/dse__airflow-ml-key.pem for airflow-ml due to open /etc/kubernetes/pki/dse__airflow-ml-key.pem: permission denied

We have used analytics-deployers for all of the other instances, but even then I think that it's mainly Data-Platform-SRE who would need to interact with these namespaces.

I'm getting an error when check the k8s namespace for airflow-ml.

@isarantopoulos have you got something specific in mind that you would like to use kubectl to achieve?
Maybe there is a way of doing the same from the UI, or from the airflow cli.
I'm not trying to be mean with privileges, just working on getting the right level of functionality for you. You're the first team who has onboarded directly to a k8s-based instance.

@BTullis It just struck me that you need access to the -deploy user credentials to use the airflow CLI, as you need to exec into an airflow container to run it.

@BTullis It just struck me that you need access to the -deploy user credentials to use the airflow CLI, as you need to exec into an airflow container to run it.

Yes, that's true, but maybe that should be a driver for us to implement a better way of accessing the airflow cli, rather than relaxing the permissions on the namespaces. Just thining aloud, really. Probably not something we need to fix today, either.

I was just following the documentation to see if everything works. At the moment I don't need need it for anything specific just to be able to explore things and debug anything that may come up.
I assumed that since it is our instance and we're not interfering with anybody else's work we would have ownership on that.
All good for now, we can revisit privileges when we start working on this if such a requirement arises.
thank you both!