Page MenuHomePhabricator

Update editor_month generation to use change_tag table rather than tag_summary
Closed, ResolvedPublic

Description

As part of the grand change_tag refactoring (T185355), the tag_summary table is being dropped (T209525), so we need to change update_editor_month.sql to use change_tag.

Ideally, we'd skip this and go straight to calculating these metrics from the Data Lake (T173853), but change_tag is likely to be dropped soon, before change tags are added to the Data Lake (T161149).

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptNov 29 2018, 11:06 PM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf moved this task from Triage to Next Up on the Product-Analytics board.
nshahquinn-wmf added a subscriber: kzimmerman.

@kzimmerman I've pulled this into Next Up since we need to do this before calculating November's movement metrics (T210040).

@Neil_P._Quinn_WMF Hey, this should work for the first part, the second part would be similar:

select
  left(rev_timestamp, 6) as `rev_month`,
  rev_user as `local_user_id`,
  count(*) as `edits`,
  sum(page_namespace = 0 or cn.namespace is not null) as content_edits,
  sum(
    ctd_name = "mobile edit" and
    (ctd_name = "mobile web edit" or ctd_name != "mobile app edit")
  ) as mobile_web_edits,
  sum(ctd_name = "mobile app edit") as mobile_app_edits,
  sum(ctd_name = "visualeditor" and ctd_name != "visualeditor-wikitext") as visual_edits,
  sum(ctd_name = "visualeditor-wikitext") as ve_source_edits,
  0 as `deleted`
from revision
left join page on rev_page = page_id
left join change_tag on rev_id = ct_rev_id
left join change_tag_def on ct_tag_id = ctd_id
left join datasets.content_namespaces cn on database() = wiki and page_namespace = namespace
where rev_timestamp between "{start}" and "{end}"
group by left(rev_timestamp, 6), rev_user

I tested this in stat1005 and worked fine. I think it should be faster now because it doesn't use LIKE anymore.

select
  left(rev_timestamp, 6) as `rev_month`,
  rev_user as `local_user_id`,
  count(*) as `edits`,
  sum(page_namespace = 0) as content_edits,
  sum(
    ctd_name = "mobile edit" and
    (ctd_name = "mobile web edit" or ctd_name != "mobile app edit")
  ) as mobile_web_edits,
  sum(ctd_name = "mobile app edit") as mobile_app_edits,
  sum(ctd_name = "visualeditor" and ctd_name != "visualeditor-wikitext") as visual_edits,
  sum(ctd_name = "visualeditor-wikitext") as ve_source_edits,
  0 as `deleted`
from revision
left join page on rev_page = page_id
left join change_tag on rev_id = ct_rev_id
left join change_tag_def on ct_tag_id = ctd_id
where rev_timestamp between "20181201000000" and "20181210000000"
group by left(rev_timestamp, 6), rev_user;

Wow! Many, many thanks for jumping in to help out 👏

I will have to tweak this a bit, because joining revision to change_tag instead of tag_summary undermines the query's assumption that the user's overall row count is their overall edit count. But this is a excellent starting place!

I came up with a revised version of my editor-month query, but it turns out to have made the performance massively bad—the Wikidata query is still running after 25 hours! (My only defense is that I was trying to finish it up late on a Friday night 😛).

I'll continue to work on something more performant, which will probably involve overcoming my reluctance and just adding another layer of query nesting.

insert into staging.editor_month
select
  database() as wiki,
  str_to_date(concat(rev_month, "01"), "%Y%m%d") as month,
  local_user_id,
  ifnull(user_name, "") as user_name,
  ifnull(sum(edits), 0) as edits,
  ifnull(sum(content_edits), 0) as content_edits,
  ifnull(sum(edits * deleted), 0) as deleted_edits,
  ifnull(sum(mobile_web_edits), 0) as mobile_web_edits,
  ifnull(sum(mobile_app_edits), 0) as mobile_app_edits,
  ifnull(sum(visual_edits), 0) as visual_edits,
  ifnull(sum(ve_source_edits), 0) as ve_source_edits,
  if(ug_group = "bot" or ufg_group = "bot", 1, 0) as bot_flag,
  str_to_date(user_registration, "%Y%m%d%H%i%S") as user_registration
from
(
select
  left(rev_timestamp, 6) as `rev_month`,
  rev_user as `local_user_id`,
  count(*) as `edits`,
  sum(page_namespace = 0 or cn.namespace is not null) as content_edits,
  sum(
    ts_tags like "%mobile edit%" and
    (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%")
  ) as mobile_web_edits,
  sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
  sum(ts_tags like "%visualeditor%" and ts_tags not like "%visualeditor-wikitext%") as visual_edits,
  sum(ts_tags like "%visualeditor-wikitext%") as ve_source_edits,
  0 as `deleted`
from revision
left join page on rev_page = page_id
left join (
    select
        ct_rev_id as ts_rev_id,
        group_concat(ctd_name) as ts_tags
    from change_tag
    left join change_tag_def
    on ct_tag_id = ctd_id
    group by ts_rev_id
) tag_summary on rev_id = ts_rev_id
left join staging.content_namespaces cn on database() = wiki and page_namespace = namespace
where rev_timestamp between "{start}" and "{end}"
group by left(rev_timestamp, 6), rev_user

union all

select
  left(ar_timestamp, 6) as `rev_month`,
  ar_user as `local_user_id`,
  count(*) as `edits`,
  sum(ar_namespace = 0 or cn.namespace is not null) as content_edits,
  sum(
    ts_tags like "%mobile edit%" and
    (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%")
  ) as mobile_web_edits,
  sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
  sum(ts_tags like "%visualeditor%" and ts_tags not like "%visualeditor-wikitext%") as visual_edits,
  sum(ts_tags like "%visualeditor-wikitext%") as ve_source_edits,
  1 as `deleted`
from archive
left join (
    select
        ct_rev_id as ts_rev_id,
        group_concat(ctd_name) as ts_tags
    from change_tag
    left join change_tag_def
    on ct_tag_id = ctd_id
    group by ts_rev_id
) tag_summary on ar_rev_id = ts_rev_id
left join staging.content_namespaces cn on database() = wiki and ar_namespace = namespace
where ar_timestamp between "{start}" and "{end}"
group by left(ar_timestamp, 6), ar_user
) revs
left join user on local_user_id = user_id
left join user_groups on local_user_id = ug_user and ug_group = "bot"
left join user_former_groups on local_user_id = ufg_user and ufg_group = "bot"
group by month, local_user_id;

I've now got something that actually works! It's in the repo on GitHub.