Page MenuHomePhabricator

Create an expanded editor-month data warehouse
Closed, ResolvedPublic

Description

This will resolve a number of underlying problems with our metrics infrastructure, most notably:

  • we depend on Wikistats, which is moribund and extremely slow, for calculating active editors
  • the new active editors number which I custom calculate is not comparable to the Wikistats active editor number because it includes edits from all namespaces rather than just content namespaces
  • ever since we merged the MobileWebEditing logs into the Edit logs (T110272), they've become too massive to query for overall statistics about mobile edits (which, to be fair, isn't what they're designed for)
  • the current editor-month table doesn't include anonymous edits, so it can't be used to calculated overall numbers of edits

The query I'll use to aggregate this data is:

SELECT
	DATABASE() as wiki,
	STR_TO_DATE(CONCAT(rev_month, "01"), "%Y%m%d") as month,
	local_user_id,
	user_name,
	SUM(edits) as edits,
	SUM(main_ns_edits) as main_ns_edits,
	SUM(edits * deleted) as deleted_edits,
	SUM(mobile_edits) as mobile_edits,
	IF((ug_group = "bot" OR ufg_group = "bot"), 1, 0) as bot,
	STR_TO_DATE(user_registration, "%Y%m%d%H%i%S") as user_registration
FROM
(
SELECT
	LEFT(rev_timestamp, 6) as `rev_month`,
	rev_user as `local_user_id`,
	COUNT(*) as `edits`,
	SUM(IF(page_namespace = 0, 1, 0)) as main_ns_edits,
	SUM(IF(ct_tag = "mobile edit", 1, 0)) as mobile_edits,
	0 as `deleted`
FROM revision
LEFT JOIN page ON rev_page = page_id
LEFT JOIN change_tag ON rev_id = ct_rev_id AND ct_tag = "mobile edit"
WHERE rev_timestamp < "201607"
GROUP BY LEFT(rev_timestamp, 6), rev_user

UNION ALL

SELECT
	LEFT(ar_timestamp, 6) as `rev_month`,
	ar_user as `local_user_id`,
	COUNT(*) as `edits`,
	SUM(IF(ar_namespace = 0, 1, 0)) as main_ns_edits,
	SUM(IF(ct_tag = "mobile edit", 1, 0)) as mobile_edits,
	1 as `deleted`
FROM archive
LEFT JOIN change_tag ON ar_rev_id = ct_rev_id AND ct_tag = "mobile edit"
WHERE ar_timestamp < "201607"
GROUP BY LEFT(ar_timestamp, 6), ar_user
) revs
LEFT JOIN user ON local_user_id = user_id
LEFT JOIN user_groups ON local_user_id = ug_user AND ug_group = "bot"
LEFT JOIN user_former_groups ON local_user_id = ufg_user AND ufg_group = "bot"
GROUP BY month, local_user_id;

This looks like an unholy monstrosity to me, but it actually seems like it runs without causing too many problems. It took 33 minutes to run for all of history on the ruwiki (one of the largest), and as far as I can tell from Tendril didn't cause any server disruption.

Other logistical details:

multiquery editor_month_upgraded.sql -d all_project_dbs.tsv -h analytics-store.eqiad.wmnet --defaults-file=~/.my.cnf > ~/editor_month_upgraded.tsv
CREATE TABLE `editor_month_upgraded` (
  `wiki` varbinary(255) not null,
  `month` date not null,
  `local_user_id` int unsigned not null,
  `user_name` varbinary(255),
  `edits` int unsigned not null,
  `main_ns_edits` int unsigned not null,
  `deleted_edits` int unsigned not null,
  `mobile_edits` int unsigned not null,
  `bot` boolean,
  `user_registration` datetime,
  PRIMARY KEY (`wiki`,`month`,`user_name`),
  KEY `wiki_user` (`wiki`,`user_name`),
  KEY `user_name` (`user_name`)
) ENGINE=TokuDB DEFAULT CHARSET=binary `compression`='tokudb_zlib';
mysqlimport --host=analytics-store.eqiad.wmnet --local --fields-terminated-by="\t" --lines-terminated-by="\n" staging ~/editor_month_upgraded.tsv

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 29 2016, 1:28 AM

The script to create the table is running right now. I expect it'll finish well within the next 24 hours.

Neil_P._Quinn_WMF closed this task as Resolved.Jul 13 2016, 7:49 PM

This is done, and the table is available as staging.editor_month_upgraded on analytics-store. There are some oddities in the table, so I'll want to recreate it at some point but even so it's in very usable shape.