Hive code to count global unique devices per top domain (like *.wikipedia.org). Initial work will be just quality checking to make sure global counts and per-site counts are in agreement, once that vetting is done we would need to calculate global counts per domain and study what percentage offset/ estimate represent of the total number.
Things to do for deploy:
- Stop currently running oozie unique_devices_project_wide jobs (daily, monthly and daily-druid)
- Hive:
- Create unique_devices_project_wide hive tables (daily and monthly)
- Move already computed time-partitioned folder structure from /user/joal/wmf/data/wmf/unique_devices/project_wide/ to /wmf/data/wmf/unique_devices/project_wide/
- Run MSCK repair on both daily and monthly prod tables.
- drop tables in joal database
- Archives: Move exisitng project-wide archives to /user/joal, not yet ready for external visibility
- Restart Oozie jobs with production settings and last-run dates, except for druid-daily that needs to be fully re-run (bug in previous run)
- Don't forget to setup archive folder to /user/joal