This is a general task to make sure that the work being done to normalize the Mediawiki links tables (T300222) is appropriately reflected in HDFS and folks know when to update scripts etc. that depend on these tables. That task and associated communication suggest that these updates will be slow and have various stages of overlap but I'm not sure of all the necessary actions / options we have.
Tasks
- Decide on timeline for making the changes. It's not fully clear to me if we have to adhere directly to the Mediawiki change timeline or if there are periods of time where the data will be duplicated and the old tables will remain viable.
- Update sqooping of SQL tables into HDFS to account for new schemas
- Update any scripts that depend on the links tables -- e.g., more "official" jobs like ClickstreamBuilder.scala but also scripts / dashboards / etc.
- ...others?