|analytics/refinery||master||+27 -5||Fix monthly insert and publish query|
|analytics/refinery||master||+2 -1||Correct overcounting of namespace zero editors|
|Resolved||Milimetric||T131280 Make aggregate data on editors per country per wiki publicly available|
|Resolved||Milimetric||T237072 Correct namespace zero editor counts on geoeditors_monthly table on hive and druid|
The code to calculate editors_per_country is over-counting namespace_zero editors since inception.
We need to:
- rerun data and druid indexation for the last 3 months.
- empty column for zero_namespace_editors for the 1-4 bucket rest of dataset (we will need to copy data aside, remove data on column namespace_zero_edit_count and put data back)
- rerun jobs for public data files
@Ijon, this is only an issue for the 1 to 4 activity level, but it raises another question we should've asked before:
We have editor counts for overall edits, and for just namespace zero edits, at the 5 to 99 and 100+ activity levels. The ns0 numbers are a bit lower, some editors only edit on other namespaces. Do you think we should publish just the ns0 counts or the overall?
I think @Asaf's o riginal request never included disclosing ns0 editors separately. Can we consider just releasing overall tally of editors without split per namespace? Meaning: we will be reporting "distinct_editors" and it is understood that that number includes editors that did not edit on the namespace zero.
Change 547689 abandoned by Nuria:
Correct overcounting of namespace zero editors
in favor of https://gerrit.wikimedia.org/r/#/c/analytics/refinery/ /547707
Plan to correct, with draft of scripts needed (EDITED using feedback below):
- null out namespace zero distinct editors for the 1 to 4 activity level for existing data
hdfs dfs -mv /wmf/data/wmf/mediawiki_private/geoeditors_monthly /wmf/data/wmf/mediawiki_private/geoeditors_monthly_backup # https://github.com/wikimedia/analytics-refinery/blob/master/hive/mediawiki/private/create_geoeditors_monthly_table.hql create table milimetric.geoeditors_monthly ... location 'hdfs://analytics-hadoop/wmf/data/wmf/mediawiki_private/geoeditors_monthly_backup' # start fresh with partitions, some will be added below, the last 2 months will be added by the job drop table wmf.geoeditors_monthly; create table (using above script) # for each month up to and including 2019-08 (there's no way to do this automatically right?) insert overwrite table wmf.geoeditors_monthly partition (month='YYYY-MM') select wiki_db, country_code, users_are_anonymous, activity_level, distinct_editors, case when activity_level in ('5 to 99', '100 or more') then namespace_zero_distinct_editors else null end as namespace_zero_distinct_editors from milimetric.geoeditors_monthly where month='YYYY-MM'; # write new success files up to 2019-08 sudo -u analytics hdfs dfs -put _SUCCESS /wmf/data/wmf/mediawiki_private/geoeditors_monthly/month=YYYY-MM/ # vet data... (should discuss how to do this as a team, I'm personally done pretending I know how to vet data) hdfs dfs -rm -r /wmf/data/wmf/mediawiki_private/geoeditors_monthly_backup
- deploy and launch the corrected monthly job to insert correct namespace_zero_distinct_editors numbers
- We'll publish overall numbers and if Asaf wants ns0 only, we can switch to that later.
- Deploy and re-run the bucketed job with the fix to ignore 0-valued distinct editors
- rerun Druid indexing coordinator (no deploy needed, this just copies the data from the monthly table, so that needs to be finished first)
Great plan @Milimetric! One precision: We only have geoeditors_daily up to 2019-09, so we can backfill from there but no more, and we need to nullify up to 2019-08.
hdfs dfs -ls /wmf/data/wmf/mediawiki_private/geoeditors_daily Found 1 items drwxr-x--- - analytics analytics-privatedata-users 0 2019-10-02 08:44 /wmf/data/wmf/mediawiki_private/geoeditors_daily/month=2019-09
Side note: This retention period feels small!! Is the deletion scheme deleting more than expected?
This retention period feels small!! Is the deletion scheme deleting more than expected?
Hehe, it feels small indeed.
I checked though, and it makes sense considering what we defined.
We didn't want to keep more than 90 days at any point.
So given that the data set is monthly and we have to either keep or delete whole months,
we chose to delete any month which has 1 day (or more) outside the retention period.
In this specific case, on 2019-11-01 the deletion script run and deleted the data for 2019-08, because already part of the first day of August was outside retention period, right?
As the mediawiki history train is long, the data for 2019-10 didn't land until a couple days after.
That's why you only saw one month of data there (the month of November will only land in December).
- nullified ns0 distinct editors in old data
- re-ran monthly insert with new logic, data will be good going forward
- bucketed data regenerated, will be synced soon
- bucketed data vetted in http://localhost:8000/user/milimetric/notebooks/Vet%20Geoeditors%20Bucketed.ipynb on notebook1003
- reindex druid https://hue.wikimedia.org/oozie/list_oozie_coordinator/0016013-190918123808661-oozie-oozi-C/
- delete backup