The parent task introduces a new table: File, where images will eventually be migrated if I understand correctly. Commons Impact Metrics should be updated to work with this.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Add file and filetypes tables to the mediawiki-not-history sqoop | operations/puppet | production | +1 -1 |
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Migrate image table to file and filetypes tables for Commons Impact Metrics | repos/data-engineering/airflow-dags!1268 | mforns | cim-migrate-image-table-to-file | main |
| Status | Subtype | Assigned | Task | |
|---|---|---|---|---|
| · · · | ||||
| Resolved | Ladsgroup | T368113 Design and merge the new tables of file tables | ||
| Resolved | mforns | T389800 Update Commons Impact Metrics to account for new File table | ||
| · · · |
Event Timeline
TL;DR; @Ahoelzl I think we should action this ASAP
@mforns I spoke to Amir and he gave me this picture of where the migration is:
- File table is created on all wikis right now
- Migration is mostly done on wikis that are not commons
- Commons will be migrated soon
- When this migration is done, MW will be writing to both, so we don't technically need to change anything as long as this is the case
- After this migration is finished, they will start the migration to deprecate and remove the image table
When that last step starts, we should have updates to Commons Impact Metrics ready. If we don't, it'll break. Amir expects that last step to be completed at the latest by June. And he thinks earlier.
Change #1139115 had a related patch set uploaded (by Mforns; author: Mforns):
[operations/puppet@production] Add file and filetypes tables to the mediawiki-not-history sqoop
mforns updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1268
Draft: Migrate image table to file and filetypes tables for Commons Impact Metrics
There's also this change (that for some reason wasn't automatically posted by gerritbot?)
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1139117
I've tested the changes:
- create statements work well
- sqoop library works well
- modifications to the hql query seem to produce the expected data
- the Airflow DAG works as expected
Moved to In Code Review!
However, the file table is still being populated in MediaWiki databases, currently at about 82%.
If we deploy now (before the next sqoop run), we might get incomplete data:
- CIM API would not be affected (the incomplete data would not reach it).
- In the Data Lake, the only CIM affected table would be wmf_contributors.commons_media_file_metrics_snapshot for year_month="2025-04" which would have <15% of the records with media_type=NULL.
- The public CIM dumps would have the same incompleteness as the Data Lake tables above.
mforns merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1268
Migrate image table to file and filetypes tables for Commons Impact Metrics
Change #1139115 merged by Bking:
[operations/puppet@production] Add file and filetypes tables to the mediawiki-not-history sqoop