Page MenuHomePhabricator

Airflow instance for wikidata platform
Closed, ResolvedPublic

Description

The new Wikidata Platform team will need:

  1. A dedicated Airflow scheduler instance.
  2. The corresponding Kubernetes resources and deployment.
  3. A new Airflow DAGs monorepo setup for their pipelines.

Could you help set this up (or advise on the process/owners) so the team can start taking ownership of data pipelines currently
deployed on the Search Platform instance?

The setup for this as outlined in https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes/Operations#Creating_a_new_instance

  • Create Kubernetes read and deploy user credentials
  • Add a namespace
  • create the public and internal DNS records airflow-wikidata.wikimedia.org
  • Define the PG cluster and airflow instance helmfile.yaml files and associated values (in Review)
  • Generate the S3 keypairs for both PG and Airflow
  • Create the S3 buckets for both PG and Airflow
  • Register the service in our IDP serve
  • Issue a Kerberos keytab
  • Generate the secrets or both the PG cluster and the Airflow instance
  • Register the PG bucket name and keys
  • Create the ops group for the instance
  • Create the dags folder and a sample Dag
  • Create UNIX user/group analytics-wikidata and the corresponding analytics-wikidata-users
  • Create the HDFS folders
  • Configuring out-of-band backups

Event Timeline

Gehel triaged this task as Medium priority.Sep 9 2025, 2:02 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Hi @gmodena - this all sounds fine, but I have a question about point 3:

A new Airflow DAGs monorepo setup for their pipelines.

All of our other Airflow instances share the same airflow-dags git repository, and we have a subdirectory per team.
Is there a reason why this shared repository model wouldn't work for the Wikidata Platform team?

Hi @gmodena - this all sounds fine, but I have a question about point 3:

A new Airflow DAGs monorepo setup for their pipelines.

All of our other Airflow instances share the same airflow-dags git repository, and we have a subdirectory per team.
Is there a reason why this shared repository model wouldn't work for the Wikidata Platform team?

Welp. Phrasing. I meant having a new directory in airflow-dags.

We'll need an instance that behaves exactly like the others.

Apologies for the confusion.

Change #1190968 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow: Setup an instance for wikidata platform team

https://gerrit.wikimedia.org/r/1190968

Change #1190974 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] Add airflow-wikidata namespace in admin_ng

https://gerrit.wikimedia.org/r/1190974

Change #1190975 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] Define airflow-wikidata PG cluster and airflow instance

https://gerrit.wikimedia.org/r/1190975

Change #1190977 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/dns@master] dns: provision airflow-wikidata domain

https://gerrit.wikimedia.org/r/1190977

Change #1190979 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] idp: Register airflow-wikidata IDP services

https://gerrit.wikimedia.org/r/1190979

Change #1190968 merged by Stevemunene:

[operations/puppet@production] airflow: Setup an instance for wikidata platform team

https://gerrit.wikimedia.org/r/1190968

Change #1190977 merged by Stevemunene:

[operations/dns@master] dns: provision airflow-wikidata domain

https://gerrit.wikimedia.org/r/1190977

Change #1191190 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] idp: Add dummy data for airflow-wikidata

https://gerrit.wikimedia.org/r/1191190

Change #1191190 merged by Stevemunene:

[labs/private@master] idp: Add dummy data for airflow-wikidata

https://gerrit.wikimedia.org/r/1191190

Change #1190979 merged by Stevemunene:

[operations/puppet@production] idp: Register airflow-wikidata IDP services

https://gerrit.wikimedia.org/r/1190979

Generated the S3 keypairs for both PG and Airflow with

sudo radosgw-admin user create --uid=postgresql-airflow-wikidata --display-name="postgresql-airflow-wikidata"
sudo radosgw-admin user create --uid=airflow-wikidata --display-name="airflow-wikidata"

Created the S3 buckets for both PG and Airflow

stevemunene@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://postgresql-airflow-wikidata.dse-k8s-eqiad
Bucket 's3://postgresql-airflow-wikidata.dse-k8s-eqiad/' created
stevemunene@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.dpe.anycast.wmnet --region=dpe --host-bucket=no mb s3://logs.airflow-wikidata.dse-k8s-eqiad
Bucket 's3://logs.airflow-wikidata.dse-k8s-eqiad/' created

Issued a keytab for the instance with

stevemunene@krb1002:~$ sudo kadmin.local addprinc -randkey analytics-wikidata/airflow-wikidata.discovery.wmnet@WIKIMEDIA
stevemunene@krb1002:~$ sudo kadmin.local addprinc -randkey airflow/airflow-wikidata.discovery.wmnet@WIKIMEDIA
stevemunene@krb1002:~$ sudo kadmin.local addprinc -randkey HTTP/airflow-wikidata.discovery.wmnet@WIKIMEDIA
stevemunene@krb1002:~$ sudo kadmin.local ktadd -norandkey -k analytics-wikidata.keytab \
    analytics-wikidata/airflow-wikidata.discovery.wmnet \
    airflow/airflow-wikidata.discovery.wmnet@WIKIMEDIA \
    HTTP/airflow-wikidata.discovery.wmnet@WIKIMEDIA
Entry for principal analytics-wikidata/airflow-wikidata.discovery.wmnet with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:analytics-wikidata.keytab.
Entry for principal airflow/airflow-wikidata.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:analytics-wikidata.keytab.
Entry for principal HTTP/airflow-wikidata.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:analytics-wikidata.keytab.
stevemunene@krb1002:~

Added the corresponding keys and secrets

Change #1190974 merged by jenkins-bot:

[operations/deployment-charts@master] Add airflow-wikidata namespace in admin_ng

https://gerrit.wikimedia.org/r/1190974

Change #1191349 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] admin/data: add the analytics-wikidata system user and user groups

https://gerrit.wikimedia.org/r/1191349

Change #1191578 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] airflow-wikidata: define ATS mapping rules and cache settings

https://gerrit.wikimedia.org/r/1191578

Change #1191349 merged by Stevemunene:

[operations/puppet@production] admin/data: add the analytics-wikidata system user and user groups

https://gerrit.wikimedia.org/r/1191349

Completed the Hadoop setup by Creating UNIX user/group and ops group then

Creating a home directory in HDFS for analytics-wikidata user

stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -mkdir /user/analytics-wikidata
mkdir: `/user/analytics-wikidata': File exists
stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -chown analytics-wikidata /user/analytics-wikidata
stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -chgrp analytics-wikidata-users /user/analytics-wikidata
stevemunene@an-master1003:~$

Creating a temporary directory for the airflow instance

stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -mkdir /tmp/wikidata
stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -chown analytics-wikidata /tmp/wikidata
stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -chown analytics-privatedata-users /tmp/wikidata

Creating an artifact directory for the airflow instance

stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -mkdir /wmf/cache/artifacts/airflow/wikidata
stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -chown blunderbuss /wmf/cache/artifacts/airflow/wikidata
stevemunene@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs dfs -chgrp blunderbuss /wmf/cache/artifacts/airflow/wikidata
stevemunene@an-master1003:~$

Change #1190975 merged by jenkins-bot:

[operations/deployment-charts@master] Define airflow-wikidata airflow instance

https://gerrit.wikimedia.org/r/1190975

Deployed the airflow-wikidata instance but gettng some challenges on the kerberose, scheduler and webserver pods.

airflow-task-shell-847d6c7df6-nn6ft     0/1     ContainerCreating   0          4s
airflow-scheduler-7ffcc89767-sdkxg      0/1     ContainerCreating   0          4s
airflow-scheduler-7ffcc89767-sdkxg      0/1     CreateContainerConfigError   0          5s
airflow-webserver-6d5c4bb98f-t8gz8      0/2     Init:CreateContainerConfigError   0          5s
airflow-kerberos-58df6b54d5-lkbq4       0/1     ContainerCreating                 0          5s
airflow-gitsync-7bf8cb5cd8-mvzxv        1/1     Running                           0          5s
airflow-task-shell-847d6c7df6-nn6ft     1/1     Running                           0          7s
airflow-statsd-7c7f99f8c8-6xbmt         1/1     Running                           0          7s
airflow-envoy-864456f7d-cgc8x           1/1     Running                           0          8s
airflow-kerberos-58df6b54d5-lkbq4       0/1     ContainerCreating                 0          19s
airflow-kerberos-58df6b54d5-lkbq4       0/1     CreateContainerConfigError        0          20s

This is from an error Warning Failed 2m18s (x11 over 4m6s) kubelet Error: secret "postgresql-airflow-wikidata-app" not found.
Checking secrets from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes/Operations#Creating_a_new_instance all seem to be present.

@Stevemunene You haven't seem to have deployed PG, which is a per-requesite of deploying Airflow.

As Steve is OOO, I'm going to take this over, to unblock @gmodena

Change #1191578 merged by Brouberol:

[operations/puppet@production] airflow-wikidata: define ATS mapping rules and cache settings

https://gerrit.wikimedia.org/r/1191578

I deployed postgresql-airflow-wikidata, after which I deployed airflow-wikidata.

NAME                                                     READY   STATUS    RESTARTS      AGE
airflow-envoy-96566d9ff-jzgh8                            1/1     Running   0             64s
airflow-gitsync-766585dc4f-9mqb2                         1/1     Running   0             64s
airflow-hadoop-shell-5c459d6b76-96bjr                    1/1     Running   0             64s
airflow-kerberos-7665f486b8-28t4l                        1/1     Running   1 (30s ago)   64s
airflow-scheduler-6c7794c4d8-xn77s                       1/1     Running   2 (40s ago)   64s
airflow-statsd-7c7f99f8c8-89sjm                          1/1     Running   0             64s
airflow-task-shell-67d7784dd9-q7h8b                      1/1     Running   0             64s
airflow-webserver-77bb66f496-7n7wc                       2/2     Running   0             64s
postgresql-airflow-wikidata-1                            1/1     Running   0             5m5s
postgresql-airflow-wikidata-2                            1/1     Running   0             3m29s
postgresql-airflow-wikidata-pooler-rw-596f66c9cc-fh8wp   1/1     Running   0             5m35s
postgresql-airflow-wikidata-pooler-rw-596f66c9cc-kqg6c   1/1     Running   0             5m35s
postgresql-airflow-wikidata-pooler-rw-596f66c9cc-sr4wh   1/1     Running   0             5m35s

We can now merge the ATS config change.

Screenshot 2025-10-14 at 16.31.44.png (1×3 px, 315 KB)
The instance is now running, and our "filler" DAG is failing to import due to some cross-folder imports we need to remove.