Page MenuHomePhabricator

Regenerate editor month table
Closed, ResolvedPublic

Description

This means essentially repeating T138894 with amendments to fix the following issues:

  • Null usernames were imported as the string "NULL" and null datetimes as the datetime 0000-00-00 00:00:00.
    • This is a limitation of the batch output mode of the mysql command line client, which doesn't have an option to use a character (like ") to enclose fields. This means that there's no way for mysqlimport to distinguish the string "NULL" from an actual null value. (There's also no option to use an escape sequence like \N for nulls.)
    • We can work around this using an insert select command instead of using mysql -e followed by mysqlimport. In the past, that caused problems because it locked the tables being selected from, but we should be able to deal with this by selecting a weak isolation level.
  • A very small number of users have names ending with backslashes, which messed with the import process and caused some data corruption (limited to individual lines).
    • It seems like this can be fixed by feeding mysqlimport the option --fields-escaped-by="" instead of the default --fields-escaped-by="\\". This makes very little sense, because a backslash is the escape character in the source file (e.g. the username \wowzeryest\ appears in the file as \\wowzeryest\\), but what can you do?
    • It can also be fixed by using insert select.
  • Counts of mobile edits (based on the mobile edit change tag) produces significant different results from counting the mobile web edit and mobile app edit tags separately and adding the totals.
    • This seems to be the combination of two separate effects: a number of edits before July 2014 that had only the mobile edit tag, and a number of edits (mainly in 2016) that have only the mobile app edit tag (T141667).
  • Massively undercounts Commons activity because most activity happens in the file namespace, not the main namespace.
    • We can work around this by creating a table of content namespaces and joining on it to get content edits. (Although technically Commons doesn't set any extra content namespaces, so it will be our own definition of "content edits", following in the footsteps of Wikistats).
  • It looks like we can add additional breakdowns of edits based on change tags without adding too much extra overhead to the query. For example, generating the editor-month rows for ruwiki for all time took 32 minutes for the previous query, and 36 minutes with an amended version which had mobile web, mobile app, and visual edit breakdowns instead of just a single mobile breakdown.
    • Visual edits: If you're using the visual editor when you press the save button, your edit gets tagged visualeditor. If you had the visual editor open at any point during your session and then switched to the wikitext editor before saving, it gets tagged as visualeditor-switched.
    • Mobile edits: Some mobile edits (I assume web ones) were only tagged with mobile edit before July 2014. Some mobile app edits have only the mobile app edit tag, and not the mobile edit tag.

Testing

Can test the procedure on Commons, since its database contains many of these (particularly non-main content namespaces and backslashes in usernames).

New queries

create table `editor_month_new` (
  `wiki` varbinary(255) not null,
  `month` date not null,
  `local_user_id` int(10) unsigned not null,
  `user_name` varbinary(255) not null default '',
  `edits` int(10) unsigned not null default '0',
  `content_edits` int(10) unsigned not null default '0',
  `deleted_edits` int(10) unsigned not null default '0',
  `mobile_web_edits` int(10) unsigned not null default '0',
  `mobile_app_edits` int(10) unsigned not null default '0',
  `visual_edits` int(10) unsigned not null default '0',
  `bot` tinyint(1) not null default '0',
  `user_registration` datetime default null,
  primary key (`wiki`,`month`,`local_user_id`),
  key `wiki_user` (`wiki`,`user_name`),
  key `user_name` (`user_name`)
) engine=TokuDB default charset=binary compression='tokudb_zlib' comment="See documentation at https://meta.wikimedia.org/wiki/Research:Editor_month_dataset";
set @start = now();

set session transaction isolation level read uncommitted;

insert into staging.editor_month_new
select
        database() as wiki,
        str_to_date(concat(rev_month, "01"), "%Y%m%d") as month,
        local_user_id,
        ifnull(user_name, "") as user_name,
        ifnull(sum(edits), 0) as edits,
        ifnull(sum(content_edits), 0) as content_edits,
        ifnull(sum(edits * deleted), 0) as deleted_edits,
        ifnull(sum(mobile_web_edits), 0) as mobile_web_edits,
        ifnull(sum(mobile_app_edits), 0) as mobile_app_edits,
        ifnull(sum(visual_edits), 0) as visual_edits,
        if(ug_group = "bot" or ufg_group = "bot", 1, 0) as bot,
        str_to_date(user_registration, "%Y%m%d%H%i%S") as user_registration
from
(
select
        left(rev_timestamp, 6) as `rev_month`,
        rev_user as `local_user_id`,
        count(*) as `edits`,
        sum(page_namespace = 0 or defined_in_site_config is not null) as content_edits,
        sum(
                ts_tags like "%mobile edit%" and 
                (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%") 
        ) as mobile_web_edits,
        sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
        sum(ts_tags like "%visualeditor%") as visual_edits,
        0 as `deleted`  
from revision
left join page on rev_page = page_id
left join tag_summary on rev_id = ts_rev_id
left join datasets.extra_content_namespaces on database() = wiki and page_namespace = namespace
where rev_timestamp < "201608"
group by left(rev_timestamp, 6), rev_user

union all

select
        left(ar_timestamp, 6) as `rev_month`,
        ar_user as `local_user_id`,
        count(*) as `edits`,
        sum(ar_namespace = 0 or defined_in_site_config is not null) as content_edits,
        sum(
                ts_tags like "%mobile edit%" and 
                (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%") 
        ) as mobile_web_edits,
        sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
        sum(ts_tags like "%visualeditor%") as visual_edits,
        1 as `deleted`
from archive
left join tag_summary on ar_rev_id = ts_rev_id
left join datasets.extra_content_namespaces on database() = wiki and ar_namespace = namespace
where ar_timestamp < "201608"
group by left(ar_timestamp, 6), ar_user
) revs
left join user on local_user_id = user_id
left join user_groups on local_user_id = ug_user and ug_group = "bot"
left join user_former_groups on local_user_id = ufg_user and ufg_group = "bot"
group by month, local_user_id;

show warnings;

select concat("Elapsed time: ", cast(timestampdiff(minute, @start, now()) as char), " minute(s)");
multiquery editor_month.sql -d all_project_dbs.tsv -h analytics-store.eqiad.wmnet --defaults-file=~/.my.cnf &> editor_month_logs.txt

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2016, 9:49 PM
Neil_P._Quinn_WMF triaged this task as High priority.Jul 23 2016, 12:14 AM
Neil_P._Quinn_WMF renamed this task from Fix inconsistencies in editor month table to Regenerate editor month table.Jul 23 2016, 12:17 AM
Neil_P._Quinn_WMF raised the priority of this task from High to Needs Triage.
Neil_P._Quinn_WMF triaged this task as Normal priority.
Neil_P._Quinn_WMF updated the task description. (Show Details)
Neil_P._Quinn_WMF updated the task description. (Show Details)

I've started the regeneration (after a lot of wrestling with stdout and stderr redirection). It should take a day or so to complete.

Neil_P._Quinn_WMF closed this task as Resolved.Aug 16 2016, 7:29 PM

The query finished Thursday evening; I've now checked the new data over and satisfied myself that there are no obvious issues. I'm going to delete all the old tables and move the new table to editor_month to prevent confusion; hopefully, I can avoid having to regenerate this from scratch again in the future.

Neil_P._Quinn_WMF raised the priority of this task from Normal to Needs Triage.Mar 30 2018, 10:25 AM
Neil_P._Quinn_WMF moved this task from Neil's in progress to Done on the Contributors-Analysis board.