Page MenuHomePhabricator

Only alert for high latency if there is enough data to make a sensible average
Closed, ResolvedPublic

Description

The elasticsearch cluster in eqiad has alerted a few times since the datacenter switchover, but the alerts aren't meaningful. Metrics report QPS numbers between 0 and 0.0002. Basically the cluster is idle and some random slow maintenance query can trigger the alert. We should implement some way that only alerts if the QPS numbers of the cluster reach some minimum threshold (1 qps?) that ensures the average latency statistics are meaningful.

Event Timeline

Change 960712 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960712

Adding T346945 as a parent task, as we'll want to revisit this during the next datacenter switchover. Basically to make sure we still get alerts with this new query, even when eqiad is the active datacenter.

Change 960712 merged by Ryan Kemper:

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960712

Change 960728 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "elastic: don't alert p95 if request volume low"

https://gerrit.wikimedia.org/r/960728

Change 960728 merged by Ryan Kemper:

[operations/puppet@production] Revert "elastic: don't alert p95 if request volume low"

https://gerrit.wikimedia.org/r/960728

Change 960717 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960717

Change 960717 merged by Ryan Kemper:

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960717

Okay, @EBernhardson, @bking and I think we've found an approach that actually works:

fallbackSeries(useSeriesAbove(transformNull(MediaWiki.CirrusSearch.$datacenter.requestTimeMs.comp_suggest.sample_rate, 0), 10, "requestTimeMs.comp_suggest.sample_rate", "requestTime.p95"), constantLine(0))

ryankemper@alert1001:~$ sudo journalctl -u icinga -f | grep -i cirrus
Sep 25 22:27:14 alert1001 icinga[22386]: SERVICE ALERT: graphite1005;CirrusSearch eqiad 95th percentile latency;OK;HARD;3;OK: Less than 20.00% above the threshold [500.0]
The current solution

We want to alert only if qps is at an acceptably high level (i.e. the datacenter is actually getting actively used).

Graphite has a seriesAbove function which, if a given datapoint is above an arbitrary threshold, will perform a string substitution to change the metric actually being queried.

This lets us only use the actual p95 metric if qps is high enough. However, that led to an issue where
for non-large time ranges, such as our 10minute alert range, we were ultimately getting a nullSeries because
at no point in the time range was the qps metric sufficiently high.

We patched over that issue by using fallbackSeries, which lets us fall back to an arbitrary series if there is no metric (i.e. we have a null series). The final piece of the hack is using a constantLine of 0 as our fallback series.

So basically for any points where our qps is too low, we're filling the values with 0's instead.

TL;DR: dark Graphite magic

Change 960721 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: standardize eqiad & codfw p95 metrics

https://gerrit.wikimedia.org/r/960721

Change 960721 merged by Ryan Kemper:

[operations/puppet@production] elastic: standardize eqiad & codfw p95 metrics

https://gerrit.wikimedia.org/r/960721

RKemper claimed this task.
RKemper triaged this task as High priority.

We've also applied the fix to codfw and made the thresholds equal between both approaches since they're back to using the same metric again.

This ticket got worked before it needed to go through the triage process so I just set the priority to high (how we treated it) and set status to resolved. Hopefully that was the correct thing to do!