Page MenuHomePhabricator

[Epic] Schedule scraper and aggregations as an Airflow job
Open, Needs TriagePublic

Description

Write a new Airflow job which calls the Technical Wishes scraper to create per-page summaries, then aggregates into per-wiki summaries.

Limit initial scope to German Wikipedia, we will expand to other wikis in later work.

sudo -u analytics-wmde kerberos-run-command analytics-wmde

hdfs dfs -chown -R analytics-wmde:analytics-privatedata-users
  • Enable job in production.

Resources

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
T412019: Monthly aggregation job for Cite referencesrepos/wmde/analytics!20awightpage-summarymain
Customize query in GitLab

Related Objects

Event Timeline

Aggregations have been reimplemented as SQL and can be code reviewed. Further progress is somewhat blocked by not having actual event data yet.

awight removed awight as the assignee of this task.Jan 8 2026, 8:49 AM
awight claimed this task.
awight updated the task description. (Show Details)
awight added a subscriber: A.Wiki1.
awight removed awight as the assignee of this task.Jan 8 2026, 1:22 PM
awight removed a subscriber: A.Wiki1.
awight renamed this task from Recreate scraper aggregations as an Airflow job to Schedule scraper and aggregations as an Airflow job.Jan 16 2026, 12:24 PM
awight claimed this task.
awight updated the task description. (Show Details)
awight renamed this task from Schedule scraper and aggregations as an Airflow job to [Epic] Schedule scraper and aggregations as an Airflow job.Jan 16 2026, 2:10 PM
awight removed awight as the assignee of this task.
awight added a project: Epic.
awight updated the task description. (Show Details)