"Throughput of EventLogging NavigationTiming events" UNKNOWN
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Apr 15 2016, 1:59 PM

Description

it looks like this event has been UNKNOWN for the last 80d in icinga, likely the metric has disappeared/been renamed, @Ottomata
perhaps?

monitoring::graphite_threshold { 'eventlogging_NavigationTiming_throughput':
    description   => 'Throughput of EventLogging NavigationTiming events',
    metric        => "kafka.${kafka_brokers_graphite_wildcard}.kafka.server.BrokerTopicMetrics.MessagesInPerSec.eventlogging_NavigationTiming.OneMinuteRate",
    warning       => 1,
    critical      => 0,
    percentage    => 15, # At least 3 of the 15 readings
    from          => '15min',
    contact_group => 'analytics',
    under         => true
}

Details

Subject	Repo	Branch	Lines +/-
statsv: Start at offset LATEST instead of the default EARLIEST	analytics/statsv	master	+2 -1
Adjust eventlogging icinga alert thresholds	operations/puppet	production	+21 -8
Kill eventlogging_NavigationTiming_throughput alert	operations/puppet	production	+0 -14

Customize query in gerrit

Event Timeline

fgiunchedi created this task.Apr 15 2016, 1:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 15 2016, 1:59 PM

Change 283667 had a related patch set uploaded (by Faidon Liambotis):
Kill eventlogging_NavigationTiming_throughput alert

https://gerrit.wikimedia.org/r/283667

Change 283667 merged by Faidon Liambotis:
Kill eventlogging_NavigationTiming_throughput alert

https://gerrit.wikimedia.org/r/283667

Change 283673 had a related patch set uploaded (by Ottomata):
Adjust eventlogging icinga alert thresholds

https://gerrit.wikimedia.org/r/283673

Change 283673 merged by Ottomata:
Adjust eventlogging icinga alert thresholds

https://gerrit.wikimedia.org/r/283673

elukey subscribed.Apr 15 2016, 7:00 PM

btw the alert still shows up as UNKNOWN in icinga since 2d

Throughput of EventLogging NavigationTiming events UNKNOWN 2016-04-18 08:24:56 2d 16h 27m 3s 3/3 UNKNOWN: No valid datapoints found

elukey added a comment.Apr 18 2016, 9:20 AM

This comment was removed by elukey.

Cancelled the above comment since it was outdated by https://gerrit.wikimedia.org/r/#/c/283673/2/modules/eventlogging/manifests/monitoring/graphite.pp, didn't see it before commenting!

fwiw looks like it is known now but warning for the last 4h

Throughput of EventLogging NavigationTiming events
WARNING	2016-04-18 14:04:00	0d 4h 0m 50s	3/3	WARNING: 100.00% of data under the warning threshold [1.0]

@fgiunchedi: the above warning is it likely related to the issue that I was experiencing this morning with Event Logging after restarting kafka on kafka1018.

Ah! So there are some services that use this stuff on hafnium managed by the performance team. I noticed that the pykafka version used by statsv there was old, and it was a version that was prone to failing after a broker restart. We had upgraded to a newer version on eventlog1001 a long time ago, but the package was never upgraded on hafnium.

I upgraded pykafka and restarted the statsv daemon. Ping @Krinkle and @ori.

@Ottomata It looks to be back up and working fine, but the restart appears to have a change in traffic.

https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=11&fullscreen&from=now-24h

https://grafana.wikimedia.org/dashboard/db/mw-js-deprecate?from=now-24h&var-Step=1h

Screen Shot 2016-04-18 at 20.03.37.png (770×2 px, 145 KB)

Hits were much higher than normal in the hours after the restart. This is normal after a restart from downtime since it would be catching up where it left it before it went down. However in this case the amount reported in the catch up phase is much more than what you get when adding up the lost hours. Something doesn't add up right.

Perhaps it started too far back, used the wrong offset to start from?

Talked a bit with Timo in IRC. The new version of pykafka change the default value of auto_offset_reset to from latest to earliest. Since the statsv consumers are not committing offsets, and do not specify auto_offset_reset, they picked up this new default value and consumed from the earliest messages stored in Kafka.

The quick fix is to just set auto_offset_reset=-1 (latest) in the topic.get_simple_consumer() call in statsv.py.

Krinkle added a project: Performance-Team.Apr 18 2016, 9:09 PM

akosiaris triaged this task as Medium priority.Apr 20 2016, 11:19 AM

Change 284836 had a related patch set uploaded (by Krinkle):
statsv: Start at offset LATEST instead of the default EARLIEST

https://gerrit.wikimedia.org/r/284836

Krinkle claimed this task.Apr 21 2016, 11:45 PM

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Change 284836 merged by Ori.livneh:
statsv: Start at offset LATEST instead of the default EARLIEST

https://gerrit.wikimedia.org/r/284836

Mentioned in SAL [2016-04-24T20:04:44Z] <ori> Deployed change Ib7e248ccf to statsv (commit id 5323cece2b3; task T132770)

looks like the alert itself now is no longer unknown, @Krinkle anything else to do within the task?

Nope, let's close this.

	F3889972: Screen Shot 2016-04-18 at 20.03.37.png
	Apr 18 2016, 7:06 PM

	F3889973: Screen Shot 2016-04-18 at 20.03.16.png
	Apr 18 2016, 7:06 PM

"Throughput of EventLogging NavigationTiming events" UNKNOWNClosed, ResolvedPublicActions

Description

Details

Event Timeline

"Throughput of EventLogging NavigationTiming events" UNKNOWN
Closed, ResolvedPublic
Actions