Page MenuHomePhabricator

pybal backends health check streamlb could depool server
Closed, ResolvedPublic

Description

during or after rcs1* reboots icinga has pybal alerts that it couldn't depool servers

17:56  <mutante> !log rcs1001 - depool from rcstream service
17:56  <mutante> arg, the other one.. 1002
17:56 -icinga-wm:#wikimedia-operations- RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
17:57  <mutante> !log rcs1002 - the last message was about 1002
17:59  <mutante> !log rcs1002 - traffic graph flat in ganglia, reboot
18:01 -icinga-wm:#wikimedia-operations- PROBLEM - Host rcs1002 is DOWN: PING CRITICAL - Packet loss = 100%
18:04 -icinga-wm:#wikimedia-operations- RECOVERY - Host rcs1002 is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms
18:05  <mutante> !log repooling rcs1002
18:10  <mutante> !log depool rcs1001 
18:14  <mutante> !log rebooting rcs1001
18:16 -icinga-wm:#wikimedia-operations- PROBLEM - Host rcs1001 is DOWN: PING CRITICAL - Packet loss = 100%
18:19 -icinga-wm:#wikimedia-operations- PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!
18:19 -icinga-wm:#wikimedia-operations- RECOVERY - Host rcs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms
18:20  <mutante> !log repooling rcs1001
18:24 -icinga-wm:#wikimedia-operations- PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!
18:25 -icinga-wm:#wikimedia-operations- PROBLEM - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down!
18:29 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on tin is OK: DISK OK
18:32  <mutante> the PyBal checks there should recover soon
18:32  <mutante> both rcs backends are repooled
18:33  <mutante> and the service is up

note that the service is up, it also shows up from monitoring the service from labs that it flapped twice

Event Timeline

Mentioned in SAL [2016-03-16T19:01:36Z] <godog> restart pybal on lvs1011 T130143

restarted pybal on standby lvs1011 and it seems to have cleared the error:

lvs1011:~$ sudo service pybal stop
lvs1011:~$ ps fwuax | grep -i pybal
filippo   9630  0.0  0.0  12728  2124 pts/0    S+   19:01   0:00              \_ grep -i pybal
lvs1011:~$ sudo service pybal start
lvs1011:~$ curl localhost:9090/alerts
OKlvs1011:~$

lvs1002 and lvs1005 (standby) also report the error, all of class high-traffic2 together with lvs1011 and lvs1008

Mentioned in SAL [2016-03-16T19:31:29Z] <godog> restart pybal on lvs1005 T130143

I'll leave lvs1002 alone for diagnostic purposes but a simple service pybal restart fixes it

@fgiunchedi pybal doesn't report any alarm anymore on lvs1002...

@Joe indeed, I was confused by not all lvs in high-traffic2 reporting the error, but that's due to T112781 and T104458

fgiunchedi claimed this task.