Context:
We can count revisions from any user but not edit count exactly like mediawiki does it (we will need flow data for example). The difference between those two counts is actually really small (0.1%) but that might not be true for ever. We might want to monitor how much edit_count (as mw understands it) differs from per_user_revision_count and have alarms on the delta among the two counts.
It seems that the only data we're not importing is actually Flow comments, based on the definition of the user table: https://www.mediawiki.org/wiki/Manual:User_table#user_editcount
Actual data can be seen here: https://docs.google.com/spreadsheets/d/1yzI7lxLQy0J9UssDWUoZZLOQI_x6_eqiqkLSiTtHHVg/edit#gid=1413502116
Code to generate data:
// Prepare Data from mediawiki_user sqlContext.sql(""" SELECT wiki_db, user_id, user_editcount FROM wmf_raw.mediawiki_user WHERE snapshot = '2017-05' AND user_id IS NOT NULL AND user_id > 0 """).registerTempTable("uec") // Prepare Data from mediawiki_history sqlContext.sql(""" SELECT wiki_db, event_user_id, MAX(revision_user_cumulative_revision_count) as rev_count FROM joal.mediawiki_history WHERE snapshot = '2017-05' AND event_user_id IS NOT NULL AND event_user_id > 0 GROUP BY wiki_db, event_user_id """).registerTempTable("urc") // Join prepared data sqlContext.sql(""" SELECT uec.wiki_db as w, COALESCE(user_id, event_user_id) as id, COALESCE(user_editcount, 0) as ec, COALESCE(rev_count, 0) as rc FROM uec FULL OUTER JOIN urc ON (uec.wiki_db = urc.wiki_db AND uec.user_id = urc.event_user_id) """).registerTempTable("juerc") // Count differences sqlContext.sql(""" SELECT w, SUM(CASE WHEN (udiff = 0) THEN 1 ELSE 0 END) as equ, SUM(CASE WHEN (udiff = 0) THEN 0 ELSE 1 END) as nequ FROM ( SELECT w, id, (ec - rc) as udiff FROM juerc GROUP BY w, id, ec, rc ) t GROUP BY w """).collect.foreach(println) // Count differences when edit count < 10 (non autoconfirmed) sqlContext.sql(""" SELECT w, SUM(CASE WHEN (udiff = 0) THEN 1 ELSE 0 END) as equ, SUM(CASE WHEN (udiff = 0) THEN 0 ELSE 1 END) as nequ FROM ( SELECT w, id, (ec - rc) as udiff FROM juerc WHERE ec < 10 GROUP BY w, id, ec, rc ) t GROUP BY w """).collect.foreach(println) // Count differences sqlContext.sql(""" SELECT w, AVG(udiff) as avgdiff, MAX(udiff) as stddevdiff, PERCENTILE(udiff, ARRAY(0.5, 0.75, 0.9, 0.95, 0.99, 0.999)) FROM ( SELECT w, id, (ec - rc) as udiff FROM juerc GROUP BY w, id, ec, rc ) t GROUP BY w """).collect.foreach(println) // Count differences sqlContext.sql(""" SELECT w, AVG(udiff) as avgdiff, MAX(udiff) as stddevdiff, PERCENTILE(udiff, ARRAY(0.5, 0.75, 0.9, 0.95, 0.99, 0.999)) FROM ( SELECT w, id, (ec - rc) as udiff FROM juerc WHERE ec < 10 GROUP BY w, id, ec, rc ) t GROUP BY w """).collect.foreach(println)