Page MenuHomePhabricator

Add hourly aggregate sequence stats creation to webrequest load job
Closed, ResolvedPublic

Description

It'd be nice to have more easily readable (and plottable, and maybe alertable?) stats about hourly webrequest sequence stats. We could augment christian's report, but I think an easy intermediate step would be to also create a new aggregate table that sums stats and percent_different from the webrequest_sequence_stats table. Something like:

create table otto.webrequest_sequence_stats_hourly as
select
  webrequest_source, year, month, day, hour,
  sum(count_actual) as count_actual,
  sum(count_expected) as count_expected,
  sum(count_different) as count_different,
  ((sum(count_actual) / sum(count_expected)) - 1) * 100.0 as percent_different
from
  wmf_raw.webrequest_sequence_stats
where
  webrequest_source='text' and year=2015 and month=8 and day = 11
 and percent_different <= 0.0 and sequence_min <> 0
group by
  webrequest_source, year, month, day, hour
;

Not sure about the percent_different <= 0.0 and sequence_min <> 0 bit, we should discuss that.

Event Timeline

Ottomata created this task.Aug 14 2015, 9:05 PM
Ottomata claimed this task.
Ottomata raised the priority of this task from to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Ottomata, JAllemandou, Milimetric.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2015, 9:05 PM
JAllemandou set Security to None.
Ottomata added a comment.EditedAug 24 2015, 6:53 PM

The load job has been augmented. I am backfilling for the month of August with this query (using dynamic partition values):

set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE wmf_raw.webrequest_sequence_stats_hourly
PARTITION(webrequest_source,year,month,day,hour)
SELECT
    count_actual,
    count_expected,
    count_null_sequence,
    count_duplicate,
    count_lost,
    ROUND(((count_duplicate / count_expected) * 100.0), 8)  AS percent_duplicate,
    ROUND(((count_lost      / count_expected) * 100.0), 8)  AS percent_lost,
    webrequest_source, year, month, day, hour
FROM (
SELECT
    SUM(count_actual)                                       AS count_actual,
    SUM(count_expected)                                     AS count_expected,
    SUM(count_null_sequence)                                AS count_null_sequence,
    SUM(count_duplicate)                                    AS count_duplicate,
    SUM(count_different) + SUM(count_duplicate)             AS count_lost,
    webrequest_source, year, month, day, hour
FROM
    wmf_raw.webrequest_sequence_stats
WHERE
    year=2015 AND month=8
    -- sequence_min == 0 means varnishkafka restarted.
    -- Even though it skews results, don't include hosts
    -- with reset seqeuence numbers in these results, as
    -- they are a common cause of false postives in percent_loss and
    -- percent_duplicate.
    AND sequence_min <> 0
 GROUP BY
  webrequest_source, year, month, day, hour
) a;
Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Aug 25 2015, 3:36 PM
kevinator closed this task as Resolved.Aug 25 2015, 11:30 PM
kevinator added a subscriber: kevinator.