Page MenuHomePhabricator

Adapt Sqoop for imagelinks schema changes
Open, Needs TriagePublic

Description

Similar to T397923 for categorylinks, the imagelinks MediaWiki table is being normalized in T299953 and we should adapt to it.

In addition to Sqoop, it's possible that Commons Impact Metrics also depends on this table directly.

I think T415786 is the ticket that tracks the actual migration in the production database.

We need to do the following:

  • Add il_target_id to Sqoop's imagelinks table definition in python/refinery/sqoop.py
  • Drop il_to column from Sqoop's imagelinks table definition by setting value to null
  • Update the CREATE HQL script of wmf_raw.mediawiki_imagelinks with the new columns added
  • Run alter table statement on table wmf_raw.mediawiki_imagelinks in Hive to the new column (ALTER TABLE ADD COLUMN...)
  • Update commons_impact_metrics hql script

Event Timeline

Change #1239200 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery@master] Adapt imagelinks pipeline and consumers for imagelink normalization

https://gerrit.wikimedia.org/r/1239200

Considering the changes for commons impact metrics (CIM), can we add a step to make sure we compare a previous CIM run with a manual CIM run with the new changes just to make sure we are good?