This means essentially repeating T138894 with amendments to fix the following issues:
- Null usernames were imported as the string "NULL" and null datetimes as the datetime 0000-00-00 00:00:00.
- This is a limitation of the batch output mode of the mysql command line client, which doesn't have an option to use a character (like ") to enclose fields. This means that there's no way for mysqlimport to distinguish the string "NULL" from an actual null value. (There's also no option to use an escape sequence like \N for nulls.)
- We can work around this using an insert select command instead of using mysql -e followed by mysqlimport. In the past, that caused problems because it locked the tables being selected from, but we should be able to deal with this by selecting a weak isolation level.
- A very small number of users have names ending with backslashes, which messed with the import process and caused some data corruption (limited to individual lines).
- It seems like this can be fixed by feeding mysqlimport the option --fields-escaped-by="" instead of the default --fields-escaped-by="\\". This makes very little sense, because a backslash is the escape character in the source file (e.g. the username \wowzeryest\ appears in the file as \\wowzeryest\\), but what can you do?
- It can also be fixed by using insert select.
- Counts of mobile edits (based on the mobile edit change tag) produces significant different results from counting the mobile web edit and mobile app edit tags separately and adding the totals.
- This seems to be the combination of two separate effects: a number of edits before July 2014 that had only the mobile edit tag, and a number of edits (mainly in 2016) that have only the mobile app edit tag (T141667).
- Massively undercounts Commons activity because most activity happens in the file namespace, not the main namespace.
- We can work around this by creating a table of content namespaces and joining on it to get content edits. (Although technically Commons doesn't set any extra content namespaces, so it will be our own definition of "content edits", following in the footsteps of Wikistats).
- It looks like we can add additional breakdowns of edits based on change tags without adding too much extra overhead to the query. For example, generating the editor-month rows for ruwiki for all time took 32 minutes for the previous query, and 36 minutes with an amended version which had mobile web, mobile app, and visual edit breakdowns instead of just a single mobile breakdown.
- Visual edits: If you're using the visual editor when you press the save button, your edit gets tagged visualeditor. If you had the visual editor open at any point during your session and then switched to the wikitext editor before saving, it gets tagged as visualeditor-switched.
- Mobile edits: Some mobile edits (I assume web ones) were only tagged with mobile edit before July 2014. Some mobile app edits have only the mobile app edit tag, and not the mobile edit tag.
Can test the procedure on Commons, since its database contains many of these (particularly non-main content namespaces and backslashes in usernames).
create table `editor_month_new` ( `wiki` varbinary(255) not null, `month` date not null, `local_user_id` int(10) unsigned not null, `user_name` varbinary(255) not null default '', `edits` int(10) unsigned not null default '0', `content_edits` int(10) unsigned not null default '0', `deleted_edits` int(10) unsigned not null default '0', `mobile_web_edits` int(10) unsigned not null default '0', `mobile_app_edits` int(10) unsigned not null default '0', `visual_edits` int(10) unsigned not null default '0', `bot` tinyint(1) not null default '0', `user_registration` datetime default null, primary key (`wiki`,`month`,`local_user_id`), key `wiki_user` (`wiki`,`user_name`), key `user_name` (`user_name`) ) engine=TokuDB default charset=binary compression='tokudb_zlib' comment="See documentation at https://meta.wikimedia.org/wiki/Research:Editor_month_dataset";
set @start = now(); set session transaction isolation level read uncommitted; insert into staging.editor_month_new select database() as wiki, str_to_date(concat(rev_month, "01"), "%Y%m%d") as month, local_user_id, ifnull(user_name, "") as user_name, ifnull(sum(edits), 0) as edits, ifnull(sum(content_edits), 0) as content_edits, ifnull(sum(edits * deleted), 0) as deleted_edits, ifnull(sum(mobile_web_edits), 0) as mobile_web_edits, ifnull(sum(mobile_app_edits), 0) as mobile_app_edits, ifnull(sum(visual_edits), 0) as visual_edits, if(ug_group = "bot" or ufg_group = "bot", 1, 0) as bot, str_to_date(user_registration, "%Y%m%d%H%i%S") as user_registration from ( select left(rev_timestamp, 6) as `rev_month`, rev_user as `local_user_id`, count(*) as `edits`, sum(page_namespace = 0 or defined_in_site_config is not null) as content_edits, sum( ts_tags like "%mobile edit%" and (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%") ) as mobile_web_edits, sum(ts_tags like "%mobile app edit%") as mobile_app_edits, sum(ts_tags like "%visualeditor%") as visual_edits, 0 as `deleted` from revision left join page on rev_page = page_id left join tag_summary on rev_id = ts_rev_id left join datasets.extra_content_namespaces on database() = wiki and page_namespace = namespace where rev_timestamp < "201608" group by left(rev_timestamp, 6), rev_user union all select left(ar_timestamp, 6) as `rev_month`, ar_user as `local_user_id`, count(*) as `edits`, sum(ar_namespace = 0 or defined_in_site_config is not null) as content_edits, sum( ts_tags like "%mobile edit%" and (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%") ) as mobile_web_edits, sum(ts_tags like "%mobile app edit%") as mobile_app_edits, sum(ts_tags like "%visualeditor%") as visual_edits, 1 as `deleted` from archive left join tag_summary on ar_rev_id = ts_rev_id left join datasets.extra_content_namespaces on database() = wiki and ar_namespace = namespace where ar_timestamp < "201608" group by left(ar_timestamp, 6), ar_user ) revs left join user on local_user_id = user_id left join user_groups on local_user_id = ug_user and ug_group = "bot" left join user_former_groups on local_user_id = ufg_user and ufg_group = "bot" group by month, local_user_id; show warnings; select concat("Elapsed time: ", cast(timestampdiff(minute, @start, now()) as char), " minute(s)");
multiquery editor_month.sql -d all_project_dbs.tsv -h analytics-store.eqiad.wmnet --defaults-file=~/.my.cnf &> editor_month_logs.txt