Page MenuHomePhabricator

Create an expanded editor-month data warehouse
Closed, ResolvedPublic


This will resolve a number of underlying problems with our metrics infrastructure, most notably:

  • we depend on Wikistats, which is moribund and extremely slow, for calculating active editors
  • the new active editors number which I custom calculate is not comparable to the Wikistats active editor number because it includes edits from all namespaces rather than just content namespaces
  • ever since we merged the MobileWebEditing logs into the Edit logs (T110272), they've become too massive to query for overall statistics about mobile edits (which, to be fair, isn't what they're designed for)
  • the current editor-month table doesn't include anonymous edits, so it can't be used to calculated overall numbers of edits

The query I'll use to aggregate this data is:

	DATABASE() as wiki,
	STR_TO_DATE(CONCAT(rev_month, "01"), "%Y%m%d") as month,
	SUM(edits) as edits,
	SUM(main_ns_edits) as main_ns_edits,
	SUM(edits * deleted) as deleted_edits,
	SUM(mobile_edits) as mobile_edits,
	IF((ug_group = "bot" OR ufg_group = "bot"), 1, 0) as bot,
	STR_TO_DATE(user_registration, "%Y%m%d%H%i%S") as user_registration
	LEFT(rev_timestamp, 6) as `rev_month`,
	rev_user as `local_user_id`,
	COUNT(*) as `edits`,
	SUM(IF(page_namespace = 0, 1, 0)) as main_ns_edits,
	SUM(IF(ct_tag = "mobile edit", 1, 0)) as mobile_edits,
	0 as `deleted`
FROM revision
LEFT JOIN page ON rev_page = page_id
LEFT JOIN change_tag ON rev_id = ct_rev_id AND ct_tag = "mobile edit"
WHERE rev_timestamp < "201607"
GROUP BY LEFT(rev_timestamp, 6), rev_user


	LEFT(ar_timestamp, 6) as `rev_month`,
	ar_user as `local_user_id`,
	COUNT(*) as `edits`,
	SUM(IF(ar_namespace = 0, 1, 0)) as main_ns_edits,
	SUM(IF(ct_tag = "mobile edit", 1, 0)) as mobile_edits,
	1 as `deleted`
FROM archive
LEFT JOIN change_tag ON ar_rev_id = ct_rev_id AND ct_tag = "mobile edit"
WHERE ar_timestamp < "201607"
GROUP BY LEFT(ar_timestamp, 6), ar_user
) revs
LEFT JOIN user ON local_user_id = user_id
LEFT JOIN user_groups ON local_user_id = ug_user AND ug_group = "bot"
LEFT JOIN user_former_groups ON local_user_id = ufg_user AND ufg_group = "bot"
GROUP BY month, local_user_id;

This looks like an unholy monstrosity to me, but it actually seems like it runs without causing too many problems. It took 33 minutes to run for all of history on the ruwiki (one of the largest), and as far as I can tell from Tendril didn't cause any server disruption.

Other logistical details:

multiquery editor_month_upgraded.sql -d all_project_dbs.tsv -h analytics-store.eqiad.wmnet --defaults-file=~/.my.cnf > ~/editor_month_upgraded.tsv
CREATE TABLE `editor_month_upgraded` (
  `wiki` varbinary(255) not null,
  `month` date not null,
  `local_user_id` int unsigned not null,
  `user_name` varbinary(255),
  `edits` int unsigned not null,
  `main_ns_edits` int unsigned not null,
  `deleted_edits` int unsigned not null,
  `mobile_edits` int unsigned not null,
  `bot` boolean,
  `user_registration` datetime,
  PRIMARY KEY (`wiki`,`month`,`user_name`),
  KEY `wiki_user` (`wiki`,`user_name`),
  KEY `user_name` (`user_name`)
) ENGINE=TokuDB DEFAULT CHARSET=binary `compression`='tokudb_zlib';
mysqlimport --host=analytics-store.eqiad.wmnet --local --fields-terminated-by="\t" --lines-terminated-by="\n" staging ~/editor_month_upgraded.tsv

Event Timeline

The script to create the table is running right now. I expect it'll finish well within the next 24 hours.

This is done, and the table is available as staging.editor_month_upgraded on analytics-store. There are some oddities in the table, so I'll want to recreate it at some point but even so it's in very usable shape.