Exclude deleted topics/posts from flow-reportcard stats
Closed, ResolvedPublic

Description

http://flow-reportcard.wmflabs.org/ shows giant spikes in unique users, active boards, active topics and messages posted, but it also shows a giant spike in moderation actions. Most (or all?) of this spike is probably spam. We should figure out a way to exclude deleted/moderated posts/topics from these graphs.

Catrope created this task.Oct 27 2015, 6:31 PM
Catrope updated the task description. (Show Details)
Catrope raised the priority of this task from to Needs Triage.
Catrope added a subscriber: Catrope.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2015, 6:31 PM
matthiasmullie set Security to None.

Change 249731 had a related patch set uploaded (by Matthias Mullie):
Exclude moderated data

https://gerrit.wikimedia.org/r/249731

Change 249731 merged by Milimetric:
Exclude moderated data

https://gerrit.wikimedia.org/r/249731

Code has been merged, but new data won't regenerate before Sunday.

New data is up.

While the numbers are definitely very different now that moderated topics are excluded, the chart is still trending in the same direction. There's still a huge increase in the last couple of weeks.

These were the original (including moderated topics) & current stats for the week of Oct 18:

original (with moderated)current (without moderated)
Unique users
810597
Active boards
198121
Active topics
1.57K328
Messages posted
2.45K1.67K

Checked in production - with old queries I got the # in the first column:

mysql:research@s3-analytics-slave [flowdb]> SELECT DATE('20151018') as weekstart,     count(distinct workflow_wiki, workflow_namespace, workflow_title_text) as num_boards FROM flow_workflow  
JOIN 
( SELECT a.tree_ancestor_id, DATE_FORMAT(FROM_UNIXTIME((conv(substring(hex(a.tree_descendant_id),1,12),16,10)>>2)/ 1000),"%Y%m%d%H%i%S") as timestamp 
     FROM flow_tree_node a 
          JOIN (SELECT b.tree_descendant_id, MAX(b.tree_depth) as max FROM flow_tree_node b                
               GROUP BY b.tree_descendant_id) y ON y.max = a.tree_depth AND y.tree_descendant_id = a.tree_descendant_id) z 
               ON z.tree_ancestor_id = workflow_id 
                           WHERE workflow_wiki NOT IN ( 'testwiki', 'test2wiki' ) AND timestamp >= '20151018 AND timestamp < '20151025';
+------------+------------+
| weekstart  | num_boards |
+------------+------------+
| 2015-10-18 |        198 |
+------------+------------+
1 row in set (0.84 sec)

The new query returns 0 - did I get the timestamp format wrong or something?:

SELECT DATE('20151028013208') AS weekstart,  COUNT(DISTINCT workflow_wiki, workflow_namespace, workflow_title_text) AS num_boards FROM (       SELECT *       FROM flow_workflow       INNER JOIN flow_revision ON rev_type_id = workflow_id AND rev_type = 'post'     WHERE       workflow_wiki NOT IN ('testwiki', 'test2wiki') AND     workflow_type = 'topic' AND     DATE_FORMAT(FROM_UNIXTIME((conv(substring(hex(workflow_id),1,12),16,10)>>2)/1000),"%
    "> Y%m%d%H%i%S") BETWEEN '20151028013208' AND '20151102'     GROUP BY rev_type, rev_type_id HAVING SUBSTRING_INDEX(GROUP_CONCAT(rev_change_type), ',', -1 ) NOT IN ('hide-topic', 'delete-topic', 'suppress-topic')) AS temp;
+------------+------------+
| weekstart  | num_boards |
+------------+------------+
| 2015-10-28 |          0 |
+------------+------------+
1 row in set (0.12 sec)

Sorry for the confusion. The timestamp format should also include hours, minutes, seconds, like: 20151018000000.

My bad (in copying/pasting it mindlessly). It actually works without following a full format for the timestamp (thx @Catrope)

For active boards:

[flowdb]> SELECT DATE('20151018') AS weekstart,  COUNT(DISTINCT workflow_wiki, workflow_namespace, workflow_title_text) AS num_boards FROM (SELECT * FROM flow_workflow INNER JOIN flow_revision ON rev_type_id = workflow_id AND rev_type = 'post'WHERE workflow_wiki NOT IN ('testwiki', 'test2wiki') AND workflow_type = 'topic' AND     DATE_FORMAT(FROM_UNIXTIME((conv(substring(hex(workflow_id),1,12),16,10)>>2)/1000),"%Y%m%d%H%i%S") BETWEEN '20151018' AND '20151025' GROUP BY rev_type, rev_type_id HAVING SUBSTRING_INDEX(GROUP_CONCAT(rev_change_type), ',', -1 ) NOT IN ('hide-topic', 'delete-topic', 'suppress-topic')) AS temp; 
+------------+------------+
| weekstart  | num_boards |
+------------+------------+
| 2015-10-18 |        122 |
+------------+------------+
1 row in set (0.56 sec)
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 3 2015, 9:05 PM
Catrope closed this task as Resolved.Nov 22 2015, 10:41 PM