Page MenuHomePhabricator

Add normalized_host.project_family and deprecate and remove normalized_host.project_class
Closed, ResolvedPublic5 Estimated Story Points


We think project_family is a user-firendly name than project_class.
We started using it in unique_devices_per_project_family jobs.
This task is about updating the normalized_host structure in webrequest to reflect the change and be coherent.

Event Timeline

JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 362159 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Rename project_class to project_family

Change 362160 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add project_family to webrequest normalized_host

How to deploy (a bit tricky...):

  1. Deploy
    • Release a new refinery-source version containing the first patch above. It should be v0.0.49, if not, update the second patch with the correct version number.
    • Update refinery with v0.0.49 jars, and deploy it with the second patch above (on HDFS as well). Now, code is ready for runtime.
  2. Runtime
    • Stop webrequest-load bundle, marking before stopping the time at which it needs to be restarted (the further from now not already finished, in any of the 3 running coordinators).
    • Wait for analytics-related oozie jobs to drain in the cluster (pageview, projectview, cassandra etc)
    • Once no more job using the webrequesttable, launch hive from hdfs user: sudo -u hdfs hive
use wmf;
drop table webrequest;

-- AS IN refinery/hive.webrequest/create_webrequest_table.hql
create table webrequest ..... ;

-- Reload exisitng data
SET hive.mapred.mode = nonstrict;

The last step is to make hive recreate exisitng partitions in webrequest table from exisiting folder hierarchy.

  • You can now restart the webrequest-load oozie bundle, using as start-time the you marked when you killed the previous bundle.
  • Check small partition misc load succeed.
  • You can also check in hive if normalized_host.project_family has value for newly computed partitions (for previously computed ones, obviously it doesn't have them, they're NULL).

I think that's it :)

JAllemandou set the point value for this task to 5.Jul 6 2017, 8:32 AM

Hm, I don't think you need to drop the table for this.

ALTER TABLE webrequest CHANGE COLUMN `normalized_host` struct<project_class: string, project_family: string, project:string, qualifiers: array<string>, tld: String>  COMMENT 'struct containing project_family (such as wikipedia or wikidata for instance), project (such as en or commons), qualifiers (a list of in-between values, such as m and/or zero) and tld (org most often)'

Or something like that :)

Change 362160 merged by Joal:
[analytics/refinery@master] Add project_family to webrequest normalized_host

Change 362159 merged by jenkins-bot:
[analytics/refinery/source@master] Rename project_class to project_family