Page MenuHomePhabricator

Finish and deploy scraper Airflow job
Closed, ResolvedPublic

Description

This subtask is about finishing the Airflow integration.

Status:

SQL and Airflow job:

We can now run the full job in an Airflow devenv. Only small deployment steps should be required to move this into production:

  • Final package release of the scraper, from the main branch.
  • Warm up the artifact cache with this packaged release.
  • Create the tables in production.

Follow up:

  • Write some reflections on Airflow integration on the Wikimedia infrastructure. (Will become a "deep dive" talk on 5 May.)
  • Document the new tables in https://datahub.wikimedia.org/ , including the completeness and time frame of backfilled data.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Only keep the latest month of Cite per-page datarepos/wmde/analytics!26awightdelete-moremain
Increase Cite Refs scraper memory to 6GBrepos/data-engineering/airflow-dags!2071awightmore-memmain
Rely on SkeinOperator krb5 setuprepos/data-engineering/airflow-dags!2070awightkrb5-defaultsmain
Customize query in GitLab

Event Timeline

awight updated the task description. (Show Details)
awight removed awight as the assignee of this task.EditedFeb 27 2026, 9:51 AM
awight updated the task description. (Show Details)
awight updated the task description. (Show Details)
awight added subscribers: xcollazo, JAllemandou, brouberol.

(CC'ing data platform engineers who have generously helped us, and who might be interested in watching the exciting conclusion.)

awight updated the task description. (Show Details)

andrewtavis-wmde merged https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/23

Template all source and destination table names in Cite refs script

I found that the BashSensor is not getting the correct KRB5CCNAME, and I'll have to implement an append_env which adds my additional variables to the executor's environment.

What I find is that these variables can be extracted from the DAG parse os.environ:

HADOOP_CONF_DIR: /etc/hadoop/conf
KRB5CCNAME: /tmp/airflow_krb5_ccache/krb5cc
KRB5_CONFIG: /etc/krb5.conf
KRB5_KEYTAB: /etc/kerberos/keytabs/airflow.keytab
KRB5_PRINCIPAL: analytics-wmde/airflow-wmde.discovery.wmnet

However, the k8s pod spec places the krb5cc somewhere else:

mountPath: "/tmp/airflow_krb5_ccache"

This causes my sensor to fail on its first attempt to access Hadoop:

kinit: Failed to store credentials: No credentials cache found (filename: /tmp/airflow_krb5_ccache/krb5cc) while getting initial credentials

The job seems to be hitting an out-of-memory error now. I increased the memory from 2GB to 4GB but now it crashes in the third chunk of dewiki.

This is surprising because earlier runs showed flat memory usage, well under 2GB.

Running locally shows that memory stabilizes just above 4GB, so I'll set the job limit to 6GB. We'll need to come back later and diagnose whether there is a new memory leak or duplication, maybe caused by the change in output methods.

For anyone who wants to monitor memory usage: Thanos