Update editor_month generation to use change_tag table rather than tag_summary
Closed, ResolvedPublic

Description

As part of the grand change_tag refactoring (T185355), the tag_summary table is being dropped (T209525), so we need to change update_editor_month.sql to use change_tag.

Ideally, we'd skip this and go straight to calculating these metrics from the Data Lake (T173853), but change_tag is likely to be dropped soon, before change tags are added to the Data Lake (T161149).

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptNov 29 2018, 11:06 PM
Neil_P._Quinn_WMF claimed this task.
Neil_P._Quinn_WMF moved this task from Triage to Next Up on the Product-Analytics board.
Neil_P._Quinn_WMF added a subscriber: kzimmerman.

@kzimmerman I've pulled this into Next Up since we need to do this before calculating November's movement metrics (T210040).

@Neil_P._Quinn_WMF Hey, this should work for the first part, the second part would be similar:

select
  left(rev_timestamp, 6) as `rev_month`,
  rev_user as `local_user_id`,
  count(*) as `edits`,
  sum(page_namespace = 0 or cn.namespace is not null) as content_edits,
  sum(
    ctd_name = "mobile edit" and
    (ctd_name = "mobile web edit" or ctd_name != "mobile app edit")
  ) as mobile_web_edits,
  sum(ctd_name = "mobile app edit") as mobile_app_edits,
  sum(ctd_name = "visualeditor" and ctd_name != "visualeditor-wikitext") as visual_edits,
  sum(ctd_name = "visualeditor-wikitext") as ve_source_edits,
  0 as `deleted`
from revision
left join page on rev_page = page_id
left join change_tag on rev_id = ct_rev_id
left join change_tag_def on ct_tag_id = ctd_id
left join datasets.content_namespaces cn on database() = wiki and page_namespace = namespace
where rev_timestamp between "{start}" and "{end}"
group by left(rev_timestamp, 6), rev_user

I tested this in stat1005 and worked fine. I think it should be faster now because it doesn't use LIKE anymore.

select
  left(rev_timestamp, 6) as `rev_month`,
  rev_user as `local_user_id`,
  count(*) as `edits`,
  sum(page_namespace = 0) as content_edits,
  sum(
    ctd_name = "mobile edit" and
    (ctd_name = "mobile web edit" or ctd_name != "mobile app edit")
  ) as mobile_web_edits,
  sum(ctd_name = "mobile app edit") as mobile_app_edits,
  sum(ctd_name = "visualeditor" and ctd_name != "visualeditor-wikitext") as visual_edits,
  sum(ctd_name = "visualeditor-wikitext") as ve_source_edits,
  0 as `deleted`
from revision
left join page on rev_page = page_id
left join change_tag on rev_id = ct_rev_id
left join change_tag_def on ct_tag_id = ctd_id
where rev_timestamp between "20181201000000" and "20181210000000"
group by left(rev_timestamp, 6), rev_user;

Wow! Many, many thanks for jumping in to help out 👏

I will have to tweak this a bit, because joining revision to change_tag instead of tag_summary undermines the query's assumption that the user's overall row count is their overall edit count. But this is a excellent starting place!

Neil_P._Quinn_WMF moved this task from Tracking to Next Up on the Product-Analytics board.

I came up with a revised version of my editor-month query, but it turns out to have made the performance massively bad—the Wikidata query is still running after 25 hours! (My only defense is that I was trying to finish it up late on a Friday night 😛).

I'll continue to work on something more performant, which will probably involve overcoming my reluctance and just adding another layer of query nesting.

insert into staging.editor_month
select
  database() as wiki,
  str_to_date(concat(rev_month, "01"), "%Y%m%d") as month,
  local_user_id,
  ifnull(user_name, "") as user_name,
  ifnull(sum(edits), 0) as edits,
  ifnull(sum(content_edits), 0) as content_edits,
  ifnull(sum(edits * deleted), 0) as deleted_edits,
  ifnull(sum(mobile_web_edits), 0) as mobile_web_edits,
  ifnull(sum(mobile_app_edits), 0) as mobile_app_edits,
  ifnull(sum(visual_edits), 0) as visual_edits,
  ifnull(sum(ve_source_edits), 0) as ve_source_edits,
  if(ug_group = "bot" or ufg_group = "bot", 1, 0) as bot_flag,
  str_to_date(user_registration, "%Y%m%d%H%i%S") as user_registration
from
(
select
  left(rev_timestamp, 6) as `rev_month`,
  rev_user as `local_user_id`,
  count(*) as `edits`,
  sum(page_namespace = 0 or cn.namespace is not null) as content_edits,
  sum(
    ts_tags like "%mobile edit%" and
    (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%")
  ) as mobile_web_edits,
  sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
  sum(ts_tags like "%visualeditor%" and ts_tags not like "%visualeditor-wikitext%") as visual_edits,
  sum(ts_tags like "%visualeditor-wikitext%") as ve_source_edits,
  0 as `deleted`
from revision
left join page on rev_page = page_id
left join (
    select
        ct_rev_id as ts_rev_id,
        group_concat(ctd_name) as ts_tags
    from change_tag
    left join change_tag_def
    on ct_tag_id = ctd_id
    group by ts_rev_id
) tag_summary on rev_id = ts_rev_id
left join staging.content_namespaces cn on database() = wiki and page_namespace = namespace
where rev_timestamp between "{start}" and "{end}"
group by left(rev_timestamp, 6), rev_user

union all

select
  left(ar_timestamp, 6) as `rev_month`,
  ar_user as `local_user_id`,
  count(*) as `edits`,
  sum(ar_namespace = 0 or cn.namespace is not null) as content_edits,
  sum(
    ts_tags like "%mobile edit%" and
    (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%")
  ) as mobile_web_edits,
  sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
  sum(ts_tags like "%visualeditor%" and ts_tags not like "%visualeditor-wikitext%") as visual_edits,
  sum(ts_tags like "%visualeditor-wikitext%") as ve_source_edits,
  1 as `deleted`
from archive
left join (
    select
        ct_rev_id as ts_rev_id,
        group_concat(ctd_name) as ts_tags
    from change_tag
    left join change_tag_def
    on ct_tag_id = ctd_id
    group by ts_rev_id
) tag_summary on ar_rev_id = ts_rev_id
left join staging.content_namespaces cn on database() = wiki and ar_namespace = namespace
where ar_timestamp between "{start}" and "{end}"
group by left(ar_timestamp, 6), ar_user
) revs
left join user on local_user_id = user_id
left join user_groups on local_user_id = ug_user and ug_group = "bot"
left join user_former_groups on local_user_id = ufg_user and ufg_group = "bot"
group by month, local_user_id;
Neil_P._Quinn_WMF closed this task as Resolved.Tue, Jan 15, 11:19 PM

I've now got something that actually works! It's in the repo on GitHub.