Page MenuHomePhabricator

Update Commons Impact Metrics to account for new File table
Closed, ResolvedPublic

Description

The parent task introduces a new table: File, where images will eventually be migrated if I understand correctly. Commons Impact Metrics should be updated to work with this.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Migrate image table to file and filetypes tables for Commons Impact Metricsrepos/data-engineering/airflow-dags!1268mfornscim-migrate-image-table-to-filemain
Customize query in GitLab

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Ahoelzl added subscribers: mforns, Ahoelzl.

@mforns can you please clarify impact and urgency?

TL;DR; @Ahoelzl I think we should action this ASAP

@mforns I spoke to Amir and he gave me this picture of where the migration is:

  • File table is created on all wikis right now
  • Migration is mostly done on wikis that are not commons
  • Commons will be migrated soon
  • When this migration is done, MW will be writing to both, so we don't technically need to change anything as long as this is the case
  • After this migration is finished, they will start the migration to deprecate and remove the image table

When that last step starts, we should have updates to Commons Impact Metrics ready. If we don't, it'll break. Amir expects that last step to be completed at the latest by June. And he thinks earlier.

Ahoelzl triaged this task as High priority.Apr 3 2025, 3:46 PM

Change #1139115 had a related patch set uploaded (by Mforns; author: Mforns):

[operations/puppet@production] Add file and filetypes tables to the mediawiki-not-history sqoop

https://gerrit.wikimedia.org/r/1139115

mforns updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1268

Draft: Migrate image table to file and filetypes tables for Commons Impact Metrics

There's also this change (that for some reason wasn't automatically posted by gerritbot?)

https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1139117

I've tested the changes:

  • create statements work well
  • sqoop library works well
  • modifications to the hql query seem to produce the expected data
  • the Airflow DAG works as expected

Moved to In Code Review!


However, the file table is still being populated in MediaWiki databases, currently at about 82%.
If we deploy now (before the next sqoop run), we might get incomplete data:

  • CIM API would not be affected (the incomplete data would not reach it).
  • In the Data Lake, the only CIM affected table would be wmf_contributors.commons_media_file_metrics_snapshot for year_month="2025-04" which would have <15% of the records with media_type=NULL.
  • The public CIM dumps would have the same incompleteness as the Data Lake tables above.
NOTE: Maybe the best would be to wait for the 2025-04 Sqoop and the corresponding CIM DAG runs to be over before deploying this fix.

Change #1139115 merged by Bking:

[operations/puppet@production] Add file and filetypes tables to the mediawiki-not-history sqoop

https://gerrit.wikimedia.org/r/1139115