http://flow-reportcard.wmflabs.org/ shows giant spikes in unique users, active boards, active topics and messages posted, but it also shows a giant spike in moderation actions. Most (or all?) of this spike is probably spam. We should figure out a way to exclude deleted/moderated posts/topics from these graphs.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Exclude moderated data | analytics/limn-flow-data | master | +49 -49 |
Event Timeline
Change 249731 had a related patch set uploaded (by Matthias Mullie):
Exclude moderated data
New data is up.
While the numbers are definitely very different now that moderated topics are excluded, the chart is still trending in the same direction. There's still a huge increase in the last couple of weeks.
These were the original (including moderated topics) & current stats for the week of Oct 18:
original (with moderated) | current (without moderated) |
Unique users | |
810 | 597 |
Active boards | |
198 | 121 |
Active topics | |
1.57K | 328 |
Messages posted | |
2.45K | 1.67K |
Checked in production - with old queries I got the # in the first column:
mysql:research@s3-analytics-slave [flowdb]> SELECT DATE('20151018') as weekstart, count(distinct workflow_wiki, workflow_namespace, workflow_title_text) as num_boards FROM flow_workflow JOIN ( SELECT a.tree_ancestor_id, DATE_FORMAT(FROM_UNIXTIME((conv(substring(hex(a.tree_descendant_id),1,12),16,10)>>2)/ 1000),"%Y%m%d%H%i%S") as timestamp FROM flow_tree_node a JOIN (SELECT b.tree_descendant_id, MAX(b.tree_depth) as max FROM flow_tree_node b GROUP BY b.tree_descendant_id) y ON y.max = a.tree_depth AND y.tree_descendant_id = a.tree_descendant_id) z ON z.tree_ancestor_id = workflow_id WHERE workflow_wiki NOT IN ( 'testwiki', 'test2wiki' ) AND timestamp >= '20151018 AND timestamp < '20151025'; +------------+------------+ | weekstart | num_boards | +------------+------------+ | 2015-10-18 | 198 | +------------+------------+ 1 row in set (0.84 sec)
The new query returns 0 - did I get the timestamp format wrong or something?:
SELECT DATE('20151028013208') AS weekstart, COUNT(DISTINCT workflow_wiki, workflow_namespace, workflow_title_text) AS num_boards FROM ( SELECT * FROM flow_workflow INNER JOIN flow_revision ON rev_type_id = workflow_id AND rev_type = 'post' WHERE workflow_wiki NOT IN ('testwiki', 'test2wiki') AND workflow_type = 'topic' AND DATE_FORMAT(FROM_UNIXTIME((conv(substring(hex(workflow_id),1,12),16,10)>>2)/1000),"% "> Y%m%d%H%i%S") BETWEEN '20151028013208' AND '20151102' GROUP BY rev_type, rev_type_id HAVING SUBSTRING_INDEX(GROUP_CONCAT(rev_change_type), ',', -1 ) NOT IN ('hide-topic', 'delete-topic', 'suppress-topic')) AS temp; +------------+------------+ | weekstart | num_boards | +------------+------------+ | 2015-10-28 | 0 | +------------+------------+ 1 row in set (0.12 sec)
Sorry for the confusion. The timestamp format should also include hours, minutes, seconds, like: 20151018000000.
My bad (in copying/pasting it mindlessly). It actually works without following a full format for the timestamp (thx @Catrope)
For active boards:
[flowdb]> SELECT DATE('20151018') AS weekstart, COUNT(DISTINCT workflow_wiki, workflow_namespace, workflow_title_text) AS num_boards FROM (SELECT * FROM flow_workflow INNER JOIN flow_revision ON rev_type_id = workflow_id AND rev_type = 'post'WHERE workflow_wiki NOT IN ('testwiki', 'test2wiki') AND workflow_type = 'topic' AND DATE_FORMAT(FROM_UNIXTIME((conv(substring(hex(workflow_id),1,12),16,10)>>2)/1000),"%Y%m%d%H%i%S") BETWEEN '20151018' AND '20151025' GROUP BY rev_type, rev_type_id HAVING SUBSTRING_INDEX(GROUP_CONCAT(rev_change_type), ',', -1 ) NOT IN ('hide-topic', 'delete-topic', 'suppress-topic')) AS temp; +------------+------------+ | weekstart | num_boards | +------------+------------+ | 2015-10-18 | 122 | +------------+------------+ 1 row in set (0.56 sec)