Page MenuHomePhabricator

Correct namespace zero editor counts on geoeditors_monthly table on hive and druid
Closed, ResolvedPublic8 Estimated Story Points

Event Timeline

Change 547689 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery@master] Correct overcounting of namespace zero editors

https://gerrit.wikimedia.org/r/547689

The code to calculate editors_per_country is over-counting namespace_zero editors since inception.

We need to:

  • rerun data and druid indexation for the last 3 months.
  • empty column for zero_namespace_editors for the 1-4 bucket rest of dataset (we will need to copy data aside, remove data on column namespace_zero_edit_count and put data back)
  • rerun jobs for public data files
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

@Ijon, this is only an issue for the 1 to 4 activity level, but it raises another question we should've asked before:

We have editor counts for overall edits, and for just namespace zero edits, at the 5 to 99 and 100+ activity levels. The ns0 numbers are a bit lower, some editors only edit on other namespaces. Do you think we should publish just the ns0 counts or the overall?

Change 547707 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Fix monthly insert and publish query

https://gerrit.wikimedia.org/r/547707

I think @Asaf's o riginal request never included disclosing ns0 editors separately. Can we consider just releasing overall tally of editors without split per namespace? Meaning: we will be reporting "distinct_editors" and it is understood that that number includes editors that did not edit on the namespace zero.

Change 547689 abandoned by Nuria:
Correct overcounting of namespace zero editors

Reason:
in favor of https://gerrit.wikimedia.org/r/#/c/analytics/refinery/ /547707

https://gerrit.wikimedia.org/r/547689

This comment has been deleted.

Plan to correct, with draft of scripts needed (EDITED using feedback below):

  • null out namespace zero distinct editors for the 1 to 4 activity level for existing data
hdfs dfs -mv /wmf/data/wmf/mediawiki_private/geoeditors_monthly /wmf/data/wmf/mediawiki_private/geoeditors_monthly_backup

# https://github.com/wikimedia/analytics-refinery/blob/master/hive/mediawiki/private/create_geoeditors_monthly_table.hql
create table milimetric.geoeditors_monthly ... location 'hdfs://analytics-hadoop/wmf/data/wmf/mediawiki_private/geoeditors_monthly_backup'

# start fresh with partitions, some will be added below, the last 2 months will be added by the job
drop table wmf.geoeditors_monthly;
create table (using above script)

# for each month up to and including 2019-08 (there's no way to do this automatically right?)
  insert overwrite table wmf.geoeditors_monthly partition (month='YYYY-MM')
  select wiki_db, country_code, users_are_anonymous, activity_level, distinct_editors, case when activity_level in ('5 to 99', '100 or more') then namespace_zero_distinct_editors else null end as namespace_zero_distinct_editors
    from milimetric.geoeditors_monthly where month='YYYY-MM';

# write new success files up to 2019-08
sudo -u analytics hdfs dfs -put _SUCCESS /wmf/data/wmf/mediawiki_private/geoeditors_monthly/month=YYYY-MM/

# vet data... (should discuss how to do this as a team, I'm personally done pretending I know how to vet data)
hdfs dfs -rm -r /wmf/data/wmf/mediawiki_private/geoeditors_monthly_backup
  • deploy and launch the corrected monthly job to insert correct namespace_zero_distinct_editors numbers
  • We'll publish overall numbers and if Asaf wants ns0 only, we can switch to that later.
  • Deploy and re-run the bucketed job with the fix to ignore 0-valued distinct editors
  • rerun Druid indexing coordinator (no deploy needed, this just copies the data from the monthly table, so that needs to be finished first)

Correcting what we need to do, we need to empty column for zero_namespace_editors but only for the 1-4 bucket, the other two buckets are correct.

Great plan @Milimetric! One precision: We only have geoeditors_daily up to 2019-09, so we can backfill from there but no more, and we need to nullify up to 2019-08.

hdfs dfs -ls /wmf/data/wmf/mediawiki_private/geoeditors_daily
Found 1 items
drwxr-x---   - analytics analytics-privatedata-users          0 2019-10-02 08:44 /wmf/data/wmf/mediawiki_private/geoeditors_daily/month=2019-09

Side note: This retention period feels small!! Is the deletion scheme deleting more than expected?

We do not need to delete the druid datasource, re-indexing should be sufficient. Let's amend plan accordingly.

@JAllemandou

This retention period feels small!! Is the deletion scheme deleting more than expected?

Hehe, it feels small indeed.
I checked though, and it makes sense considering what we defined.
We didn't want to keep more than 90 days at any point.
So given that the data set is monthly and we have to either keep or delete whole months,
we chose to delete any month which has 1 day (or more) outside the retention period.

In this specific case, on 2019-11-01 the deletion script run and deleted the data for 2019-08, because already part of the first day of August was outside retention period, right?
As the mediawiki history train is long, the data for 2019-10 didn't land until a couple days after.
That's why you only saw one month of data there (the month of November will only land in December).

Makes sense - Thanks for the explanation @mforns :)

Change 547707 merged by Milimetric:
[analytics/refinery@master] Fix monthly insert and publish query

https://gerrit.wikimedia.org/r/547707

Status update:

TODO:

It's true that I never requested ns0 counts separately. Count of all namespaces is fine.

Nuria set the point value for this task to 8.