Page MenuHomePhabricator

Troubleshoot EventLogging validation alerts {oryx} [3 pts]
Closed, ResolvedPublic

Description

In the last days EventLogging has been sending some validation alerts that claim a slight difference between the number of raw and validated events. Those alerts have normally a 2 minute duration.
We should look into that to determine the root cause, even if those seem not to be critical.

Event Timeline

mforns claimed this task.
mforns raised the priority of this task from to Needs Triage.
mforns updated the task description. (Show Details)
mforns added a project: Analytics-Kanban.
mforns subscribed.

By listing all alert timestamps and durations I could not find any matching anomaly in graphite, the logs or the database.

Dan gave the idea that maybe icinga is getting semi-up-to-date metrics from graphite. I personally have seen lots of times how graphite metrics update in weird ways, and usually take seconds or minutes until the updates are consistent. Probably icinga is querying graphite for the last minute's metrics, but those are incomplete, generating thus false alerts.
The fact that all alerts have the (approximately) same duration of 2 minutes also makes me think that this theory is correct.

As there is no sign of erroneous functioning at the time of the alerts, I will push this task to done.
We can consider, if possible, to make icinga query for metrics with 1 minute lag maybe?

ggellerman renamed this task from Troubleshoot EventLogging validation alerts to Troubleshoot EventLogging validation alerts [3 pts].Jul 22 2015, 3:49 PM
ggellerman set Security to None.
kevinator renamed this task from Troubleshoot EventLogging validation alerts [3 pts] to Troubleshoot EventLogging validation alerts {oryx} [3 pts].Jul 22 2015, 5:23 PM

verifying this is done... It's not an EventLogging problem.
new task created to adjust Icinga alerts: T106495