Page MenuHomePhabricator

Add dimensions to editors_daily dataset
Closed, DeclinedPublic

Description

Product team want to add a editors monthly table in Druid available to use in Superset/Turnilo with the following dimensions. Since the monthly aggregations are compiled from editors_daily dataset, we want to add following dimensions to editors_daily dataset.

columndata typeDescription
platformstringAccess method (iOS, Android, Mobile web, Desktop)
user_registration_timestampstring
talkpage_edit_countbigintThe total count of edits to talk pages for this grouping
userpage_edit_countbigintThe total count of edits to user pages for this grouping

This task is a result of T245049

Event Timeline

cchen triaged this task as Medium priority.Jun 22 2020, 7:40 PM
cchen created this task.
cchen moved this task from Triage to Tracking on the Product-Analytics board.

Change 607361 had a related patch set uploaded (by Conniecc1; owner: Conniecc1):
[analytics/refinery@master] Product team want to add a editors monthly table in Druid available to use in Superset/Turnilo. Since the monthly aggregations are compiled from editors_daily dataset, we want to add following dimensions to editors_daily dataset.

https://gerrit.wikimedia.org/r/607361

Milimetric raised the priority of this task from Medium to High.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria subscribed.

Update: we will be getting to this next quarter. Assigning to @JAllemandou as background task

Starting to work on this.
I have exchanged with @Milimetric about the platform field making the number of editors non-additive.

@cchen I know you know that, but I'm still wondering if presenting the data in druid is a good idea given it's non-additive: tools such as turnilo use aggregations by default and can lead to incorrect results because of wrong usage.
Happy to be told that you'll handle that, but I still prefer to confirm :)
Thanks

I think Connie's plan was to create dashboards for folks on top of superset, taking care to present the metrics properly. I also understood that some PMs don't mind non-additive reports but I'd love to hear more about that once they actually use it. This is the main reason we decided for a separate table, so we can experiment and adjust more easily without other dependencies.

@JAllemandou @Milimetric thanks for confirming this with me. I confirm that PMs are aware the data related to platforms are non-additive and they are interested to see editor counts in by platforms. We want to start experiment with creating some test aggregation tables and making sample charts and dashboards in Superset using Presto.

Thanks @cchen - Let's make the CR move for data to appear from next month :)

Hi @cchen - I have commented on your CR last week and wanted to be sure you noticed :)

Ping @cchen on the code review - The CR moves slowly, is that ok for you @cchen or should we try to find a better coordination?

hi @JAllemandou, sorry for the delay. I thought my last edit was saved, but it's not published...i resubmitted the change.

No problem @cchen - It's a shame if the patch stays stale while needed :)

Hi @cchen - Let's organize a meeting to try to debug this issue?

hi @JAllemandou , i put some time in you calendar. feel free to move it!

Ok I think we nailed it @cchen :)
I'm commenting on the CR and adding the result of our test below.

// Query with AND clause filtering out not null revisions
val dfbase = spark.sql("""
select cuc.wiki_db,
      --geocode(cuc_ip)['country_code'] as country_code,
      --network_origin(cuc_ip) as network_origin,
      coalesce(event_user_is_bot_by_historical, array()) as user_is_bot_by,
      cuc_type as action_type,
CASE
          WHEN array_contains(revision_tags, 'ios app edit') THEN 'iOS'
          WHEN array_contains(revision_tags, 'android app edit') THEN 'Android'
          WHEN array_contains(revision_tags, 'mobile app edit') THEN 'Mobile App'
          WHEN array_contains(revision_tags, 'mobile web edit') THEN 'Mobile web'
      ELSE 'Desktop'
      END AS platform,
      LEAST(
            COALESCE(event_user_registration_timestamp, event_user_creation_timestamp),
            COALESCE(event_user_creation_timestamp, event_user_registration_timestamp)
          ) as user_registration_timestamp, 
      if(cuc_user = 0, md5(concat(cuc_ip, cuc_agent)), cuc_user) as user_fingerprint_or_id,
      if(cuc_user = 0, 1, 0) as user_is_anonymous,
      if(cuc_namespace = 0, 1, 0) as page_is_namespace_zero,
      if(cuc_namespace % 2 = 1, 1, 0) as page_is_talk,
      if(cuc_namespace = 2, 1, 0) as page_is_user,

      concat(
          substring(cuc_timestamp, 0, 4), '-',
          substring(cuc_timestamp, 5, 2), '-',
          substring(cuc_timestamp, 7, 2)
      ) as date

 from wmf_raw.mediawiki_private_cu_changes cuc
          left join
      wmf.mediawiki_history mw on mw.snapshot = '2020-10'
                                    and mw.wiki_db = cuc.wiki_db
                                    and mw.revision_id = cuc.cuc_this_oldid
where cuc.month='2020-10'
  and (mw.revision_id IS NULL)
""")

dfbase.selectExpr("user_registration_timestamp IS NULL as empty_reg").groupBy("empty_reg").count().show(100, false)

/*
+---------+-------+                                                             
|empty_reg|count  |
+---------+-------+
|true     |7076332|
+---------+-------+
*/


// Query without the AND cluase keeping only null revisions
val dfnew = spark.sql("""
select cuc.wiki_db,
      --geocode(cuc_ip)['country_code'] as country_code,
      --network_origin(cuc_ip) as network_origin,
      coalesce(event_user_is_bot_by_historical, array()) as user_is_bot_by,
      cuc_type as action_type,
CASE
          WHEN array_contains(revision_tags, 'ios app edit') THEN 'iOS'
          WHEN array_contains(revision_tags, 'android app edit') THEN 'Android'
          WHEN array_contains(revision_tags, 'mobile app edit') THEN 'Mobile App'
          WHEN array_contains(revision_tags, 'mobile web edit') THEN 'Mobile web'
      ELSE 'Desktop'
      END AS platform,
      LEAST(
            COALESCE(event_user_registration_timestamp, event_user_creation_timestamp),
            COALESCE(event_user_creation_timestamp, event_user_registration_timestamp)
          ) as user_registration_timestamp, 
      if(cuc_user = 0, md5(concat(cuc_ip, cuc_agent)), cuc_user) as user_fingerprint_or_id,
      if(cuc_user = 0, 1, 0) as user_is_anonymous,
      if(cuc_namespace = 0, 1, 0) as page_is_namespace_zero,
      if(cuc_namespace % 2 = 1, 1, 0) as page_is_talk,
      if(cuc_namespace = 2, 1, 0) as page_is_user,

      concat(
          substring(cuc_timestamp, 0, 4), '-',
          substring(cuc_timestamp, 5, 2), '-',
          substring(cuc_timestamp, 7, 2)
      ) as date

 from wmf_raw.mediawiki_private_cu_changes cuc
          left join
      wmf.mediawiki_history mw on mw.snapshot = '2020-10'
                                    and mw.wiki_db = cuc.wiki_db
                                    and mw.revision_id = cuc.cuc_this_oldid
where cuc.month='2020-10'
""")

dfnew.selectExpr("user_registration_timestamp IS NULL as empty_reg").groupBy("empty_reg").count().show(100, false)

/*
+---------+--------+                                                            
|empty_reg|count   |
+---------+--------+
|true     |9947517 |
|false    |62049644|
+---------+--------+
*/

Without the filtering of non-null revisions the number of rows with a defined user-registration is 90%. Not perfect but better.

mpopov lowered the priority of this task from High to Medium.Sep 9 2021, 8:13 PM
mpopov added subscribers: Mayakp.wiki, Iflorez, mpopov.

Lowering the priority on this.

@cchen can you please touch on this in the transition of your dataset & high-level metrics work to @Mayakp.wiki and @Iflorez?

JAllemandou removed a project: Analytics-Kanban.
JAllemandou subscribed.

Removing myself as assignee as the task has no traction currently. Will revisit as needed.

Mayakp.wiki updated the task description. (Show Details)
Mayakp.wiki removed a subscriber: Nuria.

Moving to Backlog so that we re-prioritize this task and the parent Epic task T230092 : Data exploration capabilities in Superset and Turnilo for editing data at “editor” level
at our next monthly Backlog refinement meeting.

Similar to T307883, Im inclined towards declining this task since -

  1. they've been open for a while
  2. we are re-thinking our strategy for Contributors metric since it is now a core annual plan metric. @OSefu-WMF is there a task I can link here for the work you are doing?

There is a shadow worsktream going on with Marshall focused on Contributors metric measurement strategy. How to transition things from Product to the wider Movement? Marshall is also trying to find the right answer here.