Page MenuHomePhabricator

Update HDFS links tables as Mediawiki changes
Open, Needs TriagePublic

Description

This is a general task to make sure that the work being done to normalize the Mediawiki links tables (T300222) is appropriately reflected in HDFS and folks know when to update scripts etc. that depend on these tables. That task and associated communication suggest that these updates will be slow and have various stages of overlap but I'm not sure of all the necessary actions / options we have.

Tasks

  • Decide on timeline for making the changes. It's not fully clear to me if we have to adhere directly to the Mediawiki change timeline or if there are periods of time where the data will be duplicated and the old tables will remain viable.
  • Update sqooping of SQL tables into HDFS to account for new schemas
  • Update any scripts that depend on the links tables -- e.g., more "official" jobs like ClickstreamBuilder.scala but also scripts / dashboards / etc.
  • ...others?

Event Timeline

I created this initial task as a starting place but really have no idea what the scope of work is for this so please edit / claim ownership boldly :)

Let me know if I can help on anything, one note would be that I hope these changes would make computation easier. For example if you're building the list of most used templates, you can do all the work with numeric ids first and at the end just translate the final result from id to strings. That would reduce memory usage and computation load drastically.

Thank you @Isaac for creating this task!
I have reviewed the planned change and it will indeed impact our sqooping/usage of the various links tables.
I've added myself as a subscriber of the migration task to follow implementation - Would you have any precision in term of implementation timeline @Ladsgroup?
I'll start planning the needed changes on the Data-engineering side early next week.

I've added myself as a subscriber of the migration task to follow implementation - Would you have any precision in term of implementation timeline @Ladsgroup?

It's a bit hard because it depends on the wiki. For some wikis it's not started yet and for some wikis it's already finished (it's going to be write both for a while though). For some wikis it took an hour to finish for some it'll take months.

You can easily check if it's finished or not by a query like:

select * from templatelinks where tl_target_id is NULL limit 1;

And if it returns anything, it means it's not done yet.

Thank you @Ladsgroup for the update.
We need to prioritize the change in DE soon :)

Ping @EChetty - This task needs to be priositized by the team - the SQL change is already happening and will impact some of our jobs.

@EChetty moving this to DE planning backlog because

This task needs to be prioritized by the team - the SQL change is already happening and will impact some of our jobs.

Thanks @JAllemandou