Page MenuHomePhabricator

Alerts on LVS services with one single realserver
Closed, ResolvedPublic

Description

PyBal 1.14.0 checks for badly configured pools. An Icinga warning is emitted whenever a given pool is too small to allow depooling.

The initial check was using Coordinator.canDepool(), meaning that a warning would have been emitted when the following condition was False:

len(self.servers) - len(downServers) >= len(self.servers) * self.lvsservice.getDepoolThreshold()

The alerting code has later been changed; an icinga warning is now raised if the following is True:

total < (total * crd.lvsservice.getDepoolThreshold() + 1)

However, the second approach always generates an alert if an LVS service is configured with a single real server and the depool threshold is greater than zero. This is currently the case for git-ssh4_22 and git-ssh6_22, which only feature phab1001-vcs.eqiad.wmnet in their realserver list.

We should decide whether we do not want to allow services with just one server (it is after all true that it cannot really be depooled! Depool threshold should be set to zero for those) or if the alert condition should be changed.

Event Timeline

ema triaged this task as Medium priority.Oct 10 2017, 6:09 AM
ema moved this task from Backlog to LoadBalancer on the Traffic board.

I would suggest we need to add a condition to the alert so that it gets skipped when the pool size is one backend only.

Change 383591 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] instrumentation: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383591

Change 383591 merged by Ema:
[operations/debs/pybal@master] instrumentation: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383591

Change 383804 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] instrumentation: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383804

Change 383804 merged by Ema:
[operations/debs/pybal@1.14] instrumentation: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383804

Change 383805 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] 1.14.1: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383805

Change 383807 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] 1.14.1: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383807

Change 383805 merged by Ema:
[operations/debs/pybal@master] 1.14.1: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383805

Change 383807 merged by Ema:
[operations/debs/pybal@1.14] 1.14.1: pools with one server are not misconfigured

https://gerrit.wikimedia.org/r/383807

Mentioned in SAL (#wikimedia-operations) [2017-10-13T08:18:01Z] <ema> upgrade pybal on lvs1006 to 1.14.1 T177815

Mentioned in SAL (#wikimedia-operations) [2017-10-13T08:34:09Z] <ema> upgrade pybal on lvs1003 to 1.14.1 T177815

Mentioned in SAL (#wikimedia-operations) [2017-10-16T09:54:33Z] <ema> upgrading esams LVSs to pybal 1.14.2 (T178149, T177815)

Mentioned in SAL (#wikimedia-operations) [2017-10-16T10:03:13Z] <ema> upgrading codfw LVSs to pybal 1.14.2 (T178149, T177815)

Mentioned in SAL (#wikimedia-operations) [2017-10-16T12:38:38Z] <ema> upgrading eqiad LVSs to pybal 1.14.2 (T178149, T177815)

ema claimed this task.

PyBal upgraded to 1.14.2 on all LVS hosts.