Page MenuHomePhabricator

"Throughput of EventLogging NavigationTiming events" UNKNOWN
Closed, ResolvedPublic

Description

it looks like this event has been UNKNOWN for the last 80d in icinga, likely the metric has disappeared/been renamed, @Ottomata
perhaps?

monitoring::graphite_threshold { 'eventlogging_NavigationTiming_throughput':
    description   => 'Throughput of EventLogging NavigationTiming events',
    metric        => "kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_NavigationTiming.OneMinuteRate",
    warning       => 1,
    critical      => 0,
    percentage    => 15, # At least 3 of the 15 readings
    from          => '15min',
    contact_group => 'analytics',
    under         => true
}

Event Timeline

Change 283667 had a related patch set uploaded (by Faidon Liambotis):
Kill eventlogging_NavigationTiming_throughput alert

https://gerrit.wikimedia.org/r/283667

Change 283667 merged by Faidon Liambotis:
Kill eventlogging_NavigationTiming_throughput alert

https://gerrit.wikimedia.org/r/283667

Change 283673 had a related patch set uploaded (by Ottomata):
Adjust eventlogging icinga alert thresholds

https://gerrit.wikimedia.org/r/283673

Change 283673 merged by Ottomata:
Adjust eventlogging icinga alert thresholds

https://gerrit.wikimedia.org/r/283673

btw the alert still shows up as UNKNOWN in icinga since 2d

Throughput of EventLogging NavigationTiming events UNKNOWN 2016-04-18 08:24:56 2d 16h 27m 3s 3/3 UNKNOWN: No valid datapoints found

This comment was removed by elukey.

fwiw looks like it is known now but warning for the last 4h

Throughput of EventLogging NavigationTiming events
WARNING	2016-04-18 14:04:00	0d 4h 0m 50s	3/3	WARNING: 100.00% of data under the warning threshold [1.0]

@fgiunchedi: the above warning is it likely related to the issue that I was experiencing this morning with Event Logging after restarting kafka on kafka1018.

Ah! So there are some services that use this stuff on hafnium managed by the performance team. I noticed that the pykafka version used by statsv there was old, and it was a version that was prone to failing after a broker restart. We had upgraded to a newer version on eventlog1001 a long time ago, but the package was never upgraded on hafnium.

I upgraded pykafka and restarted the statsv daemon. Ping @Krinkle and @ori.

@Ottomata It looks to be back up and working fine, but the restart appears to have a change in traffic.

https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=11&fullscreen&from=now-24h

Screen Shot 2016-04-18 at 20.03.16.png (628×2 px, 59 KB)

https://grafana.wikimedia.org/dashboard/db/mw-js-deprecate?from=now-24h&var-Step=1h

Screen Shot 2016-04-18 at 20.03.37.png (770×2 px, 145 KB)

Hits were much higher than normal in the hours after the restart. This is normal after a restart from downtime since it would be catching up where it left it before it went down. However in this case the amount reported in the catch up phase is much more than what you get when adding up the lost hours. Something doesn't add up right.

Perhaps it started too far back, used the wrong offset to start from?

Talked a bit with Timo in IRC. The new version of pykafka change the default value of auto_offset_reset to from latest to earliest. Since the statsv consumers are not committing offsets, and do not specify auto_offset_reset, they picked up this new default value and consumed from the earliest messages stored in Kafka.

The quick fix is to just set auto_offset_reset=-1 (latest) in the topic.get_simple_consumer() call in statsv.py.

akosiaris triaged this task as Medium priority.Apr 20 2016, 11:19 AM

Change 284836 had a related patch set uploaded (by Krinkle):
statsv: Start at offset LATEST instead of the default EARLIEST

https://gerrit.wikimedia.org/r/284836

Change 284836 merged by Ori.livneh:
statsv: Start at offset LATEST instead of the default EARLIEST

https://gerrit.wikimedia.org/r/284836

Mentioned in SAL [2016-04-24T20:04:44Z] <ori> Deployed change Ib7e248ccf to statsv (commit id 5323cece2b3; task T132770)

looks like the alert itself now is no longer unknown, @Krinkle anything else to do within the task?

Krinkle closed this task as Resolved.EditedMay 7 2016, 2:35 PM

Nope, let's close this.