Page MenuHomePhabricator

pybal: race condition in alerts instrumentation
Closed, ResolvedPublic

Description

Due to a race condition in alerts, discrepancies between /alerts and /pools can arise.

$ curl -s http://localhost:9090/alerts
search_9200 - Could not depool server elastic1051.eqiad.wmnet because of too many down!

However, everything is fine with the host, both pybal-wise and in ipvsadm:

$ curl -s http://localhost:9090/pools/search_9200/elastic1051.eqiad.wmnet
enabled/up/pooled

An old patch from @Joe seems related. I've included it in PyBal 1.14.0, currently being tested on pybal-test200[123]. We'll see if it does fix this issue; the current workaround is restarting pybal.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Sep 21 2017, 7:06 AM
ema updated the task description. (Show Details)
ema added a subscriber: Gehel.

Mentioned in SAL (#wikimedia-operations) [2017-09-21T07:09:19Z] <ema> bounce pybal on lvs1003 to clear stale alert T176388

fgiunchedi claimed this task.
fgiunchedi subscribed.

AFAICT this issue hasn't reoccurred, boldly resolving