This means essentially repeating T138894 with amendments to fix the following issues:
* Null usernames were imported as the string `"NULL"` and null datetimes as the datetime `0000-00-00 00:00:00`.
** This is a limitation of the batch output mode of the `mysql` command line client, which doesn't have an option to use a character (like `"`) to enclose fields. This means that there's no way for `mysqlimport` to distinguish the string `"NULL"` from an actual null value. (There's also no option to use an escape sequence like `\N` for nulls.)
** We can work around this using an `insert select` command instead of using `mysql -e` followed by `mysqlimport`. In the past, that caused problems because it locked the tables being selected from, but we should be able to deal with this by selecting a weak isolation level.
* A very small number of users have names ending with backslashes, which messed with the import process and caused some data corruption (limited to individual lines).
** It seems like this can be fixed by feeding `mysqlimport` the option `--fields-escaped-by=""` instead of the default `--fields-escaped-by="\\"`. This makes very little sense, because a backslash is the escape character in the source file (e.g. the username `\wowzeryest\` appears in the file as `\\wowzeryest\\`), but what can you do?
** It can also be fixed by using `insert select`.
* Counts of mobile edits (based on the "mobile edit" change tag) produces significant different results from counting the "mobile web edit" and "mobile app edit" tags separately and adding the totals.
* Massively undercounts Commons activity because most activity happens in the file namespace, not the main namespace.
** We can work around this by creating a table of content namespaces and joining on it to get content edits. (Although technically Commons doesn't set any extra content namespaces, so it will be our own definition of "content edits", following in the footsteps of Wikistats).
* Additional fields?
** Visual edits?
** Split mobile edits into web and apps?
Can test the procedure on Commons, since its database contains many of these (particularly non-main content namespaces and backslashes in usernames).
CREATE TABLE `editor_month` (
`wiki` varbinary(255) NOT NULL,
`month` date NOT NULL,
`local_user_id` int(10) unsigned NOT NULL,
`user_name` varbinary(255) DEFAULT NULL,
`edits` int(10) unsigned NOT NULL,
`main_ns_edits` int(10) unsigned NOT NULL,
`deleted_edits` int(10) unsigned NOT NULL,
`mobile_edits` int(10) unsigned NOT NULL,
`bot` tinyint(1) DEFAULT 0,
`user_registration` datetime DEFAULT NULL,
PRIMARY KEY (`wiki`,`month`,`local_user_id`),
KEY `wiki_user` (`wiki`,`user_name`),
KEY `user_name` (`user_name`)
) ENGINE=TokuDB DEFAULT CHARSET=binary `compression`='tokudb_zlib';