Page MenuHomePhabricator

Productionize analysis of editcount vs per_user_revision_count
Open, LowPublic

Description

Context:
We can count revisions from any user but not edit count exactly like mediawiki does it (we will need flow data for example). The difference between those two counts is actually really small (0.1%) but that might not be true for ever. We might want to monitor how much edit_count (as mw understands it) differs from per_user_revision_count and have alarms on the delta among the two counts.

It seems that the only data we're not importing is actually Flow comments, based on the definition of the user table: https://www.mediawiki.org/wiki/Manual:User_table#user_editcount

Actual data can be seen here: https://docs.google.com/spreadsheets/d/1yzI7lxLQy0J9UssDWUoZZLOQI_x6_eqiqkLSiTtHHVg/edit#gid=1413502116

Code to generate data:

// Prepare Data from mediawiki_user
sqlContext.sql("""
SELECT
  wiki_db,
  user_id,
  user_editcount
FROM wmf_raw.mediawiki_user
WHERE snapshot = '2017-05'
  AND user_id IS NOT NULL
  AND user_id  > 0
""").registerTempTable("uec")


// Prepare Data from mediawiki_history
sqlContext.sql("""
SELECT
  wiki_db,
  event_user_id,
  MAX(revision_user_cumulative_revision_count) as rev_count
FROM joal.mediawiki_history
WHERE snapshot = '2017-05'
  AND event_user_id IS NOT NULL
  AND event_user_id > 0
GROUP BY
  wiki_db,
  event_user_id
""").registerTempTable("urc")


// Join prepared data
sqlContext.sql("""
SELECT
  uec.wiki_db as w,
  COALESCE(user_id, event_user_id) as id,
  COALESCE(user_editcount, 0) as ec,
  COALESCE(rev_count, 0) as rc
FROM uec
  FULL OUTER JOIN urc
    ON (uec.wiki_db = urc.wiki_db AND uec.user_id = urc.event_user_id)
""").registerTempTable("juerc")


// Count differences
sqlContext.sql("""
SELECT
  w,
  SUM(CASE WHEN (udiff = 0) THEN 1 ELSE 0 END) as equ,
  SUM(CASE WHEN (udiff = 0) THEN 0 ELSE 1 END) as nequ
FROM (
  SELECT
    w,
    id,
    (ec - rc) as udiff
  FROM juerc
  GROUP BY w, id, ec, rc
) t
GROUP BY w
""").collect.foreach(println)

// Count differences when edit count < 10 (non autoconfirmed)
sqlContext.sql("""
SELECT
  w,
  SUM(CASE WHEN (udiff = 0) THEN 1 ELSE 0 END) as equ,
  SUM(CASE WHEN (udiff = 0) THEN 0 ELSE 1 END) as nequ
FROM (
  SELECT
    w,
    id,
    (ec - rc) as udiff
  FROM juerc
  WHERE ec < 10
  GROUP BY w, id, ec, rc
) t
GROUP BY w
""").collect.foreach(println)


// Count differences
sqlContext.sql("""
SELECT
  w,
  AVG(udiff) as avgdiff,
  MAX(udiff) as stddevdiff,
  PERCENTILE(udiff, ARRAY(0.5, 0.75, 0.9, 0.95, 0.99, 0.999))
FROM (
  SELECT
    w,
    id,
    (ec - rc) as udiff
  FROM juerc
  GROUP BY w, id, ec, rc
) t
GROUP BY w
""").collect.foreach(println)


// Count differences
sqlContext.sql("""
SELECT
  w,
  AVG(udiff) as avgdiff,
  MAX(udiff) as stddevdiff,
  PERCENTILE(udiff, ARRAY(0.5, 0.75, 0.9, 0.95, 0.99, 0.999))
FROM (
  SELECT
    w,
    id,
    (ec - rc) as udiff
  FROM juerc
  WHERE ec < 10
  GROUP BY w, id, ec, rc
) t
GROUP BY w
""").collect.foreach(println)

Event Timeline

This task is still real. We have never taken the time to check the difference and investigate. Linking mediawiki-history and event where edit-counts from the DB is present will make it a lot easier.