The new Wikidata Platform team will need:
- A dedicated Airflow scheduler instance.
- The corresponding Kubernetes resources and deployment.
- A new Airflow DAGs monorepo setup for their pipelines.
Could you help set this up (or advise on the process/owners) so the team can start taking ownership of data pipelines currently
deployed on the Search Platform instance?
The setup for this as outlined in https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes/Operations#Creating_a_new_instance
- Create Kubernetes read and deploy user credentials
- Add a namespace
- create the public and internal DNS records airflow-wikidata.wikimedia.org
- Define the PG cluster and airflow instance helmfile.yaml files and associated values (in Review)
- Generate the S3 keypairs for both PG and Airflow
- Create the S3 buckets for both PG and Airflow
- Register the service in our IDP serve
- Issue a Kerberos keytab
- Generate the secrets or both the PG cluster and the Airflow instance
- Register the PG bucket name and keys
- Create the ops group for the instance
- Create the dags folder and a sample Dag
- Create UNIX user/group analytics-wikidata and the corresponding analytics-wikidata-users
- Create the HDFS folders
- Configuring out-of-band backups

