Only alert for high latency if there is enough data to make a sensible average
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Sep 25 2023, 7:19 PM

Description

The elasticsearch cluster in eqiad has alerted a few times since the datacenter switchover, but the alerts aren't meaningful. Metrics report QPS numbers between 0 and 0.0002. Basically the cluster is idle and some random slow maintenance query can trigger the alert. We should implement some way that only alerts if the QPS numbers of the cluster reach some minimum threshold (1 qps?) that ensures the average latency statistics are meaningful.

Details

Subject	Repo	Branch	Lines +/-
elastic: standardize eqiad & codfw p95 metrics	operations/puppet	production	+4 -4
elastic: don't alert p95 if request volume low	operations/puppet	production	+1 -1
Revert "elastic: don't alert p95 if request volume low"	operations/puppet	production	+1 -1
elastic: don't alert p95 if request volume low	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T346945 2023-09-20 Elasticsearch unavailable incident
		Resolved		RKemper	T347341 Only alert for high latency if there is enough data to make a sensible average

Event Timeline

EBernhardson created this task.Sep 25 2023, 7:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2023, 7:19 PM

bking subscribed.Sep 25 2023, 9:47 PM

Change 960712 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960712

gerritbot added a project: Patch-For-Review.Sep 25 2023, 9:48 PM

Adding T346945 as a parent task, as we'll want to revisit this during the next datacenter switchover. Basically to make sure we still get alerts with this new query, even when eqiad is the active datacenter.

Change 960712 merged by Ryan Kemper:

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960712

Maintenance_bot removed a project: Patch-For-Review.Sep 25 2023, 10:10 PM

Change 960728 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "elastic: don't alert p95 if request volume low"

https://gerrit.wikimedia.org/r/960728

Change 960728 merged by Ryan Kemper:

[operations/puppet@production] Revert "elastic: don't alert p95 if request volume low"

https://gerrit.wikimedia.org/r/960728

Change 960717 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960717

Change 960717 merged by Ryan Kemper:

[operations/puppet@production] elastic: don't alert p95 if request volume low

https://gerrit.wikimedia.org/r/960717

Maintenance_bot removed a project: Patch-For-Review.Sep 25 2023, 10:31 PM

Okay, @EBernhardson, @bking and I think we've found an approach that actually works:

fallbackSeries(useSeriesAbove(transformNull(MediaWiki.CirrusSearch.$datacenter.requestTimeMs.comp_suggest.sample_rate, 0), 10, "requestTimeMs.comp_suggest.sample_rate", "requestTime.p95"), constantLine(0))

ryankemper@alert1001:~$ sudo journalctl -u icinga -f | grep -i cirrus
Sep 25 22:27:14 alert1001 icinga[22386]: SERVICE ALERT: graphite1005;CirrusSearch eqiad 95th percentile latency;OK;HARD;3;OK: Less than 20.00% above the threshold [500.0]

The current solution

We want to alert only if qps is at an acceptably high level (i.e. the datacenter is actually getting actively used).

Graphite has a seriesAbove function which, if a given datapoint is above an arbitrary threshold, will perform a string substitution to change the metric actually being queried.

This lets us only use the actual p95 metric if qps is high enough. However, that led to an issue where
for non-large time ranges, such as our 10minute alert range, we were ultimately getting a nullSeries because
at no point in the time range was the qps metric sufficiently high.

We patched over that issue by using fallbackSeries, which lets us fall back to an arbitrary series if there is no metric (i.e. we have a null series). The final piece of the hack is using a constantLine of 0 as our fallback series.

So basically for any points where our qps is too low, we're filling the values with 0's instead.

TL;DR: dark Graphite magic

Change 960721 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: standardize eqiad & codfw p95 metrics

https://gerrit.wikimedia.org/r/960721

Change 960721 merged by Ryan Kemper:

[operations/puppet@production] elastic: standardize eqiad & codfw p95 metrics

https://gerrit.wikimedia.org/r/960721

We've also applied the fix to codfw and made the thresholds equal between both approaches since they're back to using the same metric again.

This ticket got worked before it needed to go through the triage process so I just set the priority to high (how we treated it) and set status to resolved. Hopefully that was the correct thing to do!

Maintenance_bot removed a project: Patch-For-Review.Sep 25 2023, 11:10 PM

bking awarded a token.Sep 26 2023, 1:18 PM

Only alert for high latency if there is enough data to make a sensible averageClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

The current solution

Only alert for high latency if there is enough data to make a sensible average
Closed, ResolvedPublic
Actions

Related Objects
Search...