We think project_family is a user-firendly name than project_class.
We started using it in unique_devices_per_project_family jobs.
This task is about updating the normalized_host structure in webrequest to reflect the change and be coherent.
Description
Description
Details
Details
Event Timeline
Comment Actions
Change 362159 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Rename project_class to project_family
Comment Actions
Change 362160 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add project_family to webrequest normalized_host
Comment Actions
How to deploy (a bit tricky...):
- Deploy
- Release a new refinery-source version containing the first patch above. It should be v0.0.49, if not, update the second patch with the correct version number.
- Update refinery with v0.0.49 jars, and deploy it with the second patch above (on HDFS as well). Now, code is ready for runtime.
- Runtime
- Stop webrequest-load bundle, marking before stopping the time at which it needs to be restarted (the further from now not already finished, in any of the 3 running coordinators).
- Wait for analytics-related oozie jobs to drain in the cluster (pageview, projectview, cassandra etc)
- Once no more job using the webrequesttable, launch hive from hdfs user: sudo -u hdfs hive
use wmf; drop table webrequest; -- AS IN refinery/hive.webrequest/create_webrequest_table.hql create table webrequest ..... ; -- Reload exisitng data SET hive.mapred.mode = nonstrict; MSCK REPAIR TABLE webrequest;
The last step is to make hive recreate exisitng partitions in webrequest table from exisiting folder hierarchy.
- You can now restart the webrequest-load oozie bundle, using as start-time the you marked when you killed the previous bundle.
- Check small partition misc load succeed.
- You can also check in hive if normalized_host.project_family has value for newly computed partitions (for previously computed ones, obviously it doesn't have them, they're NULL).
I think that's it :)
Comment Actions
Hm, I don't think you need to drop the table for this.
ALTER TABLE webrequest CHANGE COLUMN `normalized_host` struct<project_class: string, project_family: string, project:string, qualifiers: array<string>, tld: String> COMMENT 'struct containing project_family (such as wikipedia or wikidata for instance), project (such as en or commons), qualifiers (a list of in-between values, such as m and/or zero) and tld (org most often)'
Or something like that :)
Comment Actions
Change 362160 merged by Joal:
[analytics/refinery@master] Add project_family to webrequest normalized_host
Comment Actions
Change 362159 merged by jenkins-bot:
[analytics/refinery/source@master] Rename project_class to project_family