Page MenuHomePhabricator

Update the editor_month table with an Airflow job
Open, HighPublic

Description

Among the movement_metric job queries, one inserts data into wmf_product.editor_month. We should migrate it to an Airflow job.

  • Write updated SQL queries for table creation and updating
  • Compare new query output with old
  • Write Airflow job
  • Test job on analytics client
  • Deploy job
  • Create new table
  • Run backfill
  • Re-run backfill with time bounds fix
  • Redeploy job with time bounds fix
  • Update metric queries to use new table
  • Update Superset dashboard to use new table
  • Document new table in DataHub
  • Remove old generation code
  • Drop old table

Details

TitleReferenceAuthorSource BranchDest Branch
update_editor_month: Fix performance and typorepos/movement-insights/sql!2nshahquinn-wmfwork/nshahquinn-wmf/editor_monthmain
Add new editor_month jobrepos/data-engineering/airflow-dags!702nshahquinn-wmfeditor_monthmain
Add new queries for the editor_month tablerepos/movement-insights/sql!1nshahquinn-wmfwork/nshahquinn-wmf/editor_monthmain
Customize query in GitLab

Event Timeline

nshahquinn-wmf created this task.

Currently blocked on T364359 since I need to sudo as analytics-product to spin up a test Airflow instance.

This is now finished and awaiting review.

The job has been deployed to production, but failed because wmf_contributors.editor_month hasn't been created yet. I don't have the permissions to do that, so I've asked a member of Data Platform to do it instead.

Once the table has been created, we will need to clear the status of the failed run so it will be re-run and then unpause the DAG. After that the job should be fully operational.

Getting the new table created is currently blocked because it's not clear where the data should be stored in HDFS (T367243).

Following the discussion on Slack, I've created the table and changed the dir owner to analytics-product. The long-term discussion about HDSF dir architecture and ownership is here: T367243

Spark allows you to update the location of Iceberg tables, so it shouldn't be a problem to update it later.

nshahquinn-wmf moved this task from Waiting on others to Doing on the Movement-Insights board.

The job is unpaused and running correctly.

It turned out that I hadn't coded the job in way that allows us to run a backfill with a fixed mediawiki_history snapshot, and when I looked back at the advice I had gotten on how to do that, all the possible approaches lookedvery complicated. In the end, I decided just to give up on Airflow and write a simple Python script to submit the backfill queries directly. That's running now.

There are some odd inconsistencies between the old table and new table in the early months that have completed so far. I had compared the query outputs for 2024-03 in detail and things looked very consistent, so likely this is specific to early history. Still, I should look into it more.

I did a deep dive into the inconsistencies between the two versions of the table. >99% was due to the underlying shift in mediawiki_history, but there was a tiny error in the new dataset: I accidentally included edits occurring during the very first second of the following month.

Since backfilling is cheap, I'm going to re-do it with a fix for that tiny issue.