Page MenuHomePhabricator

F10. Investigate Limn dashboard for Flow, make sure it's giving us the data that we think it is
Closed, ResolvedPublic5 Estimated Story Points

Description

We count all the unique users that have done any "write" action. For example:

  • posted a new topic
  • replied to a topic or a post
  • edited a topic or a post
  • edited a board description
  • edited a topic summary
  • done any moderation action

Event Timeline

DannyH raised the priority of this task from to High.
DannyH updated the task description. (Show Details)
DannyH subscribed.
Catrope set Security to None.
SBisson updated the task description. (Show Details)
DannyH renamed this task from Investigate Limn dashboard for Flow, make sure it's giving us the data that we think it is to F10. Investigate Limn dashboard for Flow, make sure it's giving us the data that we think it is.Jul 23 2015, 12:42 AM

Change 226524 had a related patch set uploaded (by Sbisson):
Include all contributing users in 'unique-users' query

https://gerrit.wikimedia.org/r/226524

Change 226524 merged by Milimetric:
Include all contributing users in 'unique-users' query

https://gerrit.wikimedia.org/r/226524

Two questions:

  1. All below actions are captured? Or the actions that taken into count are different?
MariaDB [testwiki]> select distinct rev_change_type from flow_revision;
+----------------------+
| rev_change_type      |
+----------------------+
| reply                |
| new-post             |
| create-header        |
| edit-header          |
| edit-post            |
| edit-title           |
| lock-topic           |
| create-topic-summary |
| restore-topic        |
| edit-topic-summary   |
| hide-topic           |
+----------------------+
11 rows in set (0.00 sec)
  1. Since there is a derived table, no index is used - is it ok?
MariaDB [enwiki]> explain  SELECT weekstart,  SUM(user) as unique_users   
FROM (  SELECT 1 as user,  DATE(DATE_SUB(
FROM_UNIXTIME((conv(substring(hex(rev_id),1,12),16,10)>>2)/1000), interval DAYOFWEEK( 
FROM_UNIXTIME((conv(substring(hex(rev_id),1,12),16,10)>>2)/1000)) day)) as weekstart
FROM flow_revision  
WHERE rev_user_wiki NOT IN ( 'testwiki', 'test2wiki' )  
GROUP BY rev_user_wiki, rev_user_id, rev_user_ip ) x  
WHERE weekstart < DATE_SUB(CURDATE(), interval DAYOFWEEK(NOW()) day)  
GROUP BY weekstart\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 206376
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 2
  select_type: DERIVED
        table: flow_revision
         type: index
possible_keys: NULL
          key: flow_revision_user
      key_len: 116
          ref: NULL
         rows: 206376
        Extra: Using where; Using index; Using temporary; Using filesort

All below actions are captured? Or the actions that taken into count are different?

No filtering on rev_change_type so all those actions are included plus a few others like delete and suppress. Essentially, everybody whose actions result in a db write is included.

Since there is a derived table, no index is used - is it ok?

I think this is ok. Maybe @EBernhardson or @Milimetric have a different opinion?

No index should be alright, many of the other limn queries (both flow and other projects) tend to be unoptimized as the existing indexes are the ones used in prod, not anything tailored to the actual queries run here.

Checked for writing-to-db events - they are counted correctly.

No results for the week of 2015-04-11?

MariaDB [testwiki]> SELECT weekstart,
    ->        SUM(user) as unique_users
    ->   FROM (
    ->         SELECT 1 as user,
    ->                DATE(DATE_SUB(FROM_UNIXTIME((conv(substring(hex(rev_id),1,12),16,10)>>2)/1000), interval DAYOFWEEK( FROM_UNIXTIME((conv(substring(hex(rev_id),1,12),16,10)>>2)/1000)) day)) as weekstart
    ->           FROM flow_revision
    ->          WHERE rev_user_wiki  IN ( 'testwiki' )
    ->          GROUP BY rev_user_wiki, rev_user_id, rev_user_ip
    ->        ) x
    ->  WHERE weekstart < DATE_SUB(CURDATE(), interval DAYOFWEEK(NOW()) day)
    ->  GROUP BY weekstart;
+------------+--------------+
| weekstart  | unique_users |
+------------+--------------+
| 2014-05-17 |            1 |
| 2015-04-25 |            2 |
| 2015-05-02 |            4 |
| 2015-05-30 |            4 |
| 2015-06-13 |            1 |
| 2015-06-20 |            1 |
| 2015-06-27 |            1 |
| 2015-07-04 |            2 |
+------------+--------------+
8 rows in set (0.01 sec)

Looking at this in the office -- we found that this version was still undercounting unique users. It was showing 2 users on Catalan from July 18-24, when it was actually at least 18 (verified by going to Catalan Flow boards and counting).

Roan figured out some issues -- putting this back to In Development so Roan can continue.

DannyH edited subscribers, added: SBisson; removed: Catrope.

Change 227398 had a related patch set uploaded (by Catrope):
Fix unique-users query

https://gerrit.wikimedia.org/r/227398

Ran the fixed query on testwiki

MariaDB [testwiki]> SELECT DATE_SUB(d, interval DAYOFWEEK(d)-1 day) AS Week,  
COUNT(DISTINCT rev_user_id, COALESCE(rev_user_ip, ''), rev_user_wiki) AS "Unique users" 
 FROM 
 (SELECT DATE(FROM_UNIXTIME((conv(substring(hex(rev_id),1,12),16,10)>>2)/1000)) AS d, rev_user_id, rev_user_ip, rev_user_wiki  FROM flow_revision 
  WHERE rev_user_wiki IN ('testwiki') ) AS x GROUP BY Week HAVING Week < DATE_SUB(CURDATE(), interval DAYOFWEEK(NOW())-1 day) 
ORDER BY Week;
+------------+--------------+
| Week       | Unique users |
+------------+--------------+
| 2014-05-18 |            1 |
| 2015-04-26 |            2 |
| 2015-05-03 |            6 |
| 2015-05-10 |            3 |
| 2015-05-31 |            5 |
| 2015-06-14 |            4 |
| 2015-06-21 |            6 |
| 2015-06-28 |            1 |
| 2015-07-05 |            6 |
| 2015-07-12 |            1 |
| 2015-07-19 |            4 |
+------------+--------------+
11 rows in set (0.01 sec)

which coincides to the previous test results(shifted by one day):

MariaDB [testwiki]> select DATE(DATE_SUB(FROM_UNIXTIME((conv(substring(hex(rev_id),1,12),16,10)>>2)/1000), interval DAYOFWEEK( FROM_UNIXTIME((conv(substring(hex(rev_id),1,12),16,10)>>2)/1000)) day)) as Day, count(distinct(rev_user_id)) as Users from flow_revision  group by Day;
+------------+-------+
| Day        | Users |
+------------+-------+
| 2014-05-17 |     1 |
| 2015-04-25 |     2 |
| 2015-05-02 |     5 |
| 2015-05-09 |     3 |
| 2015-05-30 |     5 |
| 2015-06-13 |     4 |
| 2015-06-20 |     6 |
| 2015-06-27 |     1 |
| 2015-07-04 |     6 |
| 2015-07-11 |     1 |
| 2015-07-18 |     4 |
| 2015-07-25 |     2 |
+------------+-------+
12 rows in set (0.02 sec)

Change 227398 merged by Matthias Mullie:
Fix unique-users query

https://gerrit.wikimedia.org/r/227398