This means essentially repeating T138894 with amendments to fix the following issues:
* Null usernames were imported as the string `"NULL"` and null datetimes as the datetime `0000-00-00 00:00:00`.
** This is a limitation of the batch output mode of the `mysql` command line client, which doesn't have an option to use a character (like `"`) to enclose fields. This means that there's no way for `mysqlimport` to distinguish the string `"NULL"` from an actual null value. (There's also no option to use an escape sequence like `\N` for nulls.)
** We can work around this using an `insert select` command instead of using `mysql -e` followed by `mysqlimport`. In the past, that caused problems because it locked the tables being selected from, but we should be able to deal with this by selecting a weak isolation level.
* A very small number of users have names ending with backslashes, which messed with the import process and caused some data corruption (limited to individual lines).
** It seems like this can be fixed by feeding `mysqlimport` the option `--fields-escaped-by=""` instead of the default `--fields-escaped-by="\\"`. This makes very little sense, because a backslash is the escape character in the source file (e.g. the username `\wowzeryest\` appears in the file as `\\wowzeryest\\`), but what can you do?
** It can also be fixed by using `insert select`.
* Counts of mobile edits (based on the `mobile edit` change tag) produces significant different results from counting the `mobile web edit` and `mobile app edit` tags separately and adding the totals.
** This seems to be the combination of two separate effects: a number of edits before July 2014 that had only the `mobile edit` tag, and a number of edits (mainly in 2016) that have only the `mobile app edit` tag (T141667).
* Massively undercounts Commons activity because most activity happens in the file namespace, not the main namespace.
** We can work around this by creating a table of content namespaces and joining on it to get content edits. (Although technically Commons doesn't set any extra content namespaces, so it will be our own definition of "content edits", following in the footsteps of Wikistats).
* It looks like we can add additional breakdowns of edits based on change tags without adding too much extra overhead to the query. For example, generating the editor-month rows for ruwiki for all time took 32 minutes for the previous query, and 36 minutes with an amended version which had mobile web, mobile app, and visual edit breakdowns instead of just a single mobile breakdown.
** Visual edits: If you're using the visual editor when you press the save button, your edit gets tagged `visualeditor`. If you had the visual editor open at any point during your session and then switched to the wikitext editor before saving, it gets tagged as `visualeditor-switched`.
** Mobile edits: Some mobile edits (I assume web ones) were only tagged with `mobile edit` before July 2014. Some mobile app edits have only the `mobile app edit` tag, and not the `mobile edit` tag.
=Testing=
Can test the procedure on Commons, since its database contains many of these (particularly non-main content namespaces and backslashes in usernames).
=New queries=
```create table `editor_month_new` (
`wiki` varbinary(255) not null,
`month` date not null,
`local_user_id` int(10) unsigned not null,
`user_name` varbinary(255) not null default '',
`edits` int(10) unsigned not null default '0',
`content_edits` int(10) unsigned not null default '0',
`deleted_edits` int(10) unsigned not null default '0',
`mobile_web_edits` int(10) unsigned not null default '0',
`mobile_app_edits` int(10) unsigned not null default '0',
`visual_edits` int(10) unsigned not null default '0',
`bot` tinyint(1) not null default '0',
`user_registration` datetime default null,
primary key (`wiki`,`month`,`local_user_id`),
key `wiki_user` (`wiki`,`user_name`),
key `user_name` (`user_name`)
) engine=TokuDB default charset=binary `compression`='tokudb_zlib';
```
```
set @start = now();
set session transaction isolation level read uncommitted;
insert into staging.editor_month_new
select
database() as wiki,
str_to_date(concat(rev_month, "01"), "%Y%m%d") as month,
local_user_id,
ifnull(user_name, "") as user_name,
ifnull(sum(edits), 0) as edits,
ifnull(sum(content_edits), 0) as content_edits,
ifnull(sum(edits * deleted), 0) as deleted_edits,
ifnull(sum(mobile_web_edits), 0) as mobile_web_edits,
ifnull(sum(mobile_app_edits), 0) as mobile_app_edits,
ifnull(sum(visual_edits), 0) as visual_edits,
if(ug_group = "bot" or ufg_group = "bot", 1, 0) as bot,
str_to_date(user_registration, "%Y%m%d%H%i%S") as user_registration
from
(
select
left(rev_timestamp, 6) as `rev_month`,
rev_user as `local_user_id`,
count(*) as `edits`,
sum(page_namespace = 0 or defined_in_site_config is not null) as content_edits,
sum(
ts_tags like "%mobile edit%" and
(ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%")
) as mobile_web_edits,
sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
sum(ts_tags like "%visualeditor%") as visual_edits,
0 as `deleted`
from revision
left join page on rev_page = page_id
left join tag_summary on rev_id = ts_rev_id
left join datasets.extra_content_namespaces on database() = wiki and page_namespace = namespace
where rev_timestamp < "201608"
group by left(rev_timestamp, 6), rev_user
union all
select
left(ar_timestamp, 6) as `rev_month`,
ar_user as `local_user_id`,
count(*) as `edits`,
sum(ar_namespace = 0 or defined_in_site_config is not null) as content_edits,
sum(
ts_tags like "%mobile edit%" and
(ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%")
) as mobile_web_edits,
sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
sum(ts_tags like "%visualeditor%") as visual_edits,
1 as `deleted`
from archive
left join tag_summary on ar_rev_id = ts_rev_id
left join datasets.extra_content_namespaces on database() = wiki and ar_namespace = namespace
where ar_timestamp < "201608"
group by left(ar_timestamp, 6), ar_user
) revs
left join user on local_user_id = user_id
left join user_groups on local_user_id = ug_user and ug_group = "bot"
left join user_former_groups on local_user_id = ufg_user and ufg_group = "bot"
group by month, local_user_id;
show warnings;
select concat("Elapsed time: ", cast(timestampdiff(minute, @start, now()) as char), " minute(s)");
```
```
multiquery editor_month.sql -d all_project_dbs.tsv -h analytics-store.eqiad.wmnet --defaults-file=~/.my.cnf 2>&1 | tee editor_month_logs.txt
```